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<^> Abstract 

O' 

We consider the problem of reconstructing a discrete-time signal (sequence) with continuous-valued components 
corrupted by a known memoryless channel. When performance is measured using a per-symbol loss function satisfying 
mild regularity conditions, we develop a sequence of denoisers that, although independent of the distribution of 

CN ' 

f^i ■ the underlying 'clean' sequence, is universally optimal in the limit of large sequence length. This sequence of 

denoisers is universal in the sense of performing as well as any sliding window denoising scheme which may be 
optimized for the underlying clean signal. Our results are initially developed in a "semi-stochastic" setting, where the 
noiseless signal is an unknown individual sequence, and the only source of randomness is due to the channel noise. 
£_) , It is subsequently shown that in the fully stochastic setting, where the noiseless sequence is a stationary stochastic 

process, our schemes universally attain optimum performance. The proposed schemes draw from nonparametric 
density estimation techniques and are practically implementable. We demonstrate efficacy of the proposed schemes 

\^ ' in denoising gray-scale images in the conventional additive white Gaussian noise setting, with additional promising 

Oi ■ results for less conventional noise distributions. 
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I. Introduction 

Consider the problem of estimating a clean discrete-time signal (sequence) {Xt}teT, Xt £ [a, b] C R, based on 
its noisy observations {Z t }t£j, Z t £ R, where {Z t } is the output of a corruption mechanism, a memoryless channel. 
This problem finds applications in areas ranging from engineering, cryptography and statistics, to bioinformatics 
and beyond. There is significant literature on particular instantiations of this problem, most notably for the case 
where signal and noise components are real-valued and the noise is additive, most commonly Gaussian (cf. [9] 
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and references therein). Solutions to this problem in [9] are based on wavelet-based soft thresholding and have 
various asymptotic optimality properties under a minimax criterion. The scope of wavelet-based thresholding in 
[9] has been extended beyond the additive white Gaussian case in [13], [1] where optimality is again established 
in an asymptotic minimax sense. The soft-thresholding scheme proposed in [1] is among the few denoisers found 
in the literature [13], [21] that are designed for the case of a non-Gaussian corruption mechanism. Even in this 
case, restrictions to additive noise and symmetry assumptions on the noise distribution are made in order to provide 
asymptotic performance guarantees. For the case of a random vector Y = X + Z, where X is independent of Z 
(with known distribution). The Minimum Mean Squared Estimate (MMSE) of X is well-known to be given by 
X = iP(Y) = E{X\Y}. It was shown in [27] that, for Z ~ Af(n,T,), ^(-) satisfies ip(Y) = (r ~ M) ~^ ln/y(y) , 
where /y(y) is the marginal density of Y, which can be learned from the noisy samples Y n = {Y\, ■ ■ ■ ,Y n } of 
Y. Using techniques for nonparametric density estimation in [7], an estimate of Jy{v), jV(y), can be computed, 
the (appropriate) gradient of which leads to the following estimate: 

ton = (r-ri-Wr<r) (1) 

Iy 
The authors in [27] also discuss expressions for ip(Y) for a certain class of non-Gaussian noise distributions with the 
corruption mechanism continuing to be additive. This leaves room for universal denoising schemes for continuous 
valued data for a general class of noise distributions where the corruption mechanism is also arbitrary. Compression 
based approaches pioneered in (cf., e. g., [25] and [10]), as discussed in [36], are provably sub-optimal and 
suffer from non-practicality of implementation of optimal lossy compression schemes. The wavelet-based Bayesian 
estimation approach in [26], has demonstrated significant improvement in image denoising. However, despite much 
recent progress, the problem of universal denoising for discrete-time continuous-amplitude data is still a largely 
open problem of both theoretical and practical value. The problem is particularly relevant in new emerging areas as 
microarray imaging [35], array-based comparative genomic hybridization (array-CGH) [19] and medical imaging 
[34], [17], [22], where parametric noise models that are currently used often fail to capture the true nature of the 
noise. 

Recently, universal denoising for discrete signals and channels was considered in [36]. The results of [36], and 
the denoising scheme DUDE proposed therein, although attractive theoretically, are restricted in their practicality 
to problems with small alphabets. This is a result of 

• computational issues involved with collecting higher-order joint distributions from the noisy data. 

• mapping an estimated channel output distribution to an estimated channel input distribution. 

• count statistics being too sparse to be reliable for even moderately large alphabet sizes. 

This leaves open challenges in the application of DUDE to problems like gray-scale image denoising. More recently, 
a modified DUDE, using ideas from lossless compression, was presented in [24]. As discussed in that work, in spite 
of circumventing some of the computational issues mentioned above, the approach leaves room for improvement 
in the denoising performance. The problem was further extended to the discrete-valued input and general output 
alphabet setting in [5]. This approach proposes quantization of the output alphabet space and proceeds on an a 



similar line to that in [36], showing that there is no essential loss of optimality in quantizing the channel output 
before denoising (insofar as learning the statistics of the underlying data is concerned). In spite of its theoretical 
elegance, this approach faces similar issues as the scheme of [36], limiting its scope of applications to small channel 
input alphabets. The authors of [5], while conjecturing the need for mild restrictions on the channel, suggest an 
extension of the proposed scheme to the case where both the input and output alphabet space is continuous-valued 
and general. The present work proposes an extension of the two-stage DUDE-like approach in [36], [5] to the case 
of denoising for general alphabets. A natural extension would have been to quantize both the input and the output 
space and apply a similar count-statistic based two-pass approach. The vast literature on nonparametric density 
estimation (cf. [7] and references therein), however, points to the opportunity of extracting more reliable statistics 
from the observed data, that would lead to better denoising (as measured under a specified loss function). We 
do, however, maintain the sliding window approach of [5], [36] and show asymptotic universal optimality of our 
schemes with increasing context lengths in the limit of large sequence lengths. 

Recent developments in universal denoising in the particular context of images have also been reported in [4]. 
Their approach is based on local smoothing methods that make assumptions on the underlying structure of the data 
which are more relevant in image denoising due to the inherent redundancy of natural images. The consistency 
results showed the convergence of the denoising rule to the conditional expected value of the clean symbol given 
the noisy neighborhood sans the particular noisy symbol in question. There is potential to improve this result by 
incorporating the information from the noisy pixel that is being denoised too, an approach at the heart of the 
denoisers we present below. We establish the universal optimality of the suggested denoisers in a generality that 
applies to arbitrarily distributed noiseless signals, arbitrary memoryless channels, and arbitrary loss functions (with 
some benign regularity conditions). 

The remainder of the paper is organized as follows. In section II, we discuss the problem setup and notations. 
This is followed by a description of the technical results that are key to the construction of the denoisers in section 
III. In section IV, we establish universality of a family of denoisers that we develop for the semi-stochastic setting, 
in which the clean data is an individual sequence and provide bounds on the difference between the performance 
of this proposed family of denoisers and that of the best 'symbol-by-symbol' denoiser chosen by a genie with full 
knowledge of the distribution (or probability law) of the clean data. Section V details an extension of this proposed 
family of denoisers to a genie that can select the best sliding window scheme, of any order, with knowledge of the 
underlying clean data. Section VI discusses the implication of the performance guarantees in the semi-stochastic 
setting to the fully stochastic setting where the clean data is generated by a stationary stochastic process, rather 
than an individual sequence. A slightly modified version of the proposed denoiser is shown to reduce to the scheme 
of [5] when the underlying clean data have finite alphabet size. The proposed family of denoisers can, hence, be 
seen as a natural extension of those in [5] to the current setting of denoising continuous valued symbols corrupted 
by a continuous memoryless channel where the clean data components may take values in a continuum. In section 
IVIII we present some preliminary experimental results of applying the proposed schemes to denoising of gray- 
scale images. We conclude in section IVIIII with a summary of some propositions for future research directions. 



Throughout this paper, we maintain the flow by stating the Theorems and Lemmas corresponding to the optimality 
results in the main body of the paper relegating most of the proofs to the appendices. 

II. Problem Setting and Notations 
Let x = (xi, X2, • • •) be an individual (deterministic) noise-free source signal [j with components taking values in 
[a, b] C K and Y = (Yi, Y2, • • •), Y € K. be the corresponding noisy observations, also referred to as the 'output of 
the channel' (corruption source). This setting, where both the underlying clean sequence and the noisy sequence are 
continuous valued, is the continuous-amplitude analog of the semi-stochastic setting discussed in [5]. The channel 
is specified by a family of distribution functions C = {FY\ x } x ^[ a ,b]^ where Fy\ x denotes the distribution of the 
channel output symbol when the input symbol is x. Also, we denote the probability measure on M corresponding 
to F Y \ X by fj, x . We make the following assumptions about the channel, 

CI. A memoryless channel, which is to say that the components of Y are independent with Y ~ F Y \ Xi . 
C2. The family of measures, {iJ. x } x( z[a,b], associated with the channel, C, is uniformly tight in the sense 

sup /j, x ([-T,T] c ) -»0 as T -> 00. 

.);£ [a. b] 

This condition will be needed to guarantee that one can consistently track the evolution of the marginal 

density of the noisy symbols at the output of the memoryless channel, regardless of the underlying x, using 

nonparametric Kernel density estimation techniques. 
C3. The distribution functions Fy\ x are absolutely continuous for all x € [a, b] w.r.t the Lebesgue measure and 

{/Via;} denotes the corresponding densities. This assumption is not crucial for the validity of our approach 

but is made for concreteness in the construction of our schemes and the development of their performance 

guarantees. 
C4. The conditional densities of the channel form a set of linearly independent functions. This is equivalent to 

the "invertibility" condition of [36] which ensures that, to any distribution on the input to the channel there 

corresponds a unique channel output. 
C5. The mapping, w.r.t a metric that will be detailed in section [TTTJ from the space of channel input distributions 

to the corresponding channel output distributions is continuous. The precise analytical expression describing 

this condition is discussed in Appendix U 
C6. The expected loss, for reasonably well-behaved loss functions (conditions L1-L2 listed subsequently in this 

section), induced by two output distributions that are close (under the metric discussed in section [TTTb is 

continuous. Again, the analytical expression describing this condition is in the Appendix U 
The above, are rather benign conditions obeyed by most channels arising in practice, an example of this being 
the most commonly addressed channel, viz., the Additive White Gaussian Noise Channel (AWGN). It is easy to 
verify that even the multiplicative (non-additive) Gaussian channel with a finite variance and mean satisfies these 

'throughout the paper we will be using the terms 'signal' and 'sequence' interchangeably 



requirements. In this case, the channel input (underlying clean signal) affects the variance of the channel. The fact 
that the underlying clean signal takes only bounded values implies that the tightness condition, C2, is satisfied. In 
fact, any additive noise channel with distribution functions that are absolutely continuous and the corresponding 
densities (of finite mean and variance) satisfying conditions C4-7 (C7 discussed in Appendix U} will satisfy the 
above requirements. 

An n-block denoiser is a measurable mapping taking M." into [a, b] n . We assume a loss function A : [a.b] 2 — + 
[0, oo) and denote the normalized cumulative loss of an n-block denoiser X n , when the underlying sequence is x n 
and the observed sequence is y n , by 



1 " 

^„(x",y") = -^A(x J ,A"(y")[z]) 



n 

i=l 



(2) 



where X n (y n )[i] denotes the i-th component of X n (y n ). In addition to the constraints on the channel, we impose 
some conditions on the permissible loss functions, A. We assume the loss function, A, 

LI. to be bounded,i.e., A max < oo where A max = sup^guy A(x, x) 

L2. to be a bounded Lipschitz function. More formally, we require the Lipschitz norm, ||A||l < oo. The Lipschitz 
norm of the loss function, is defined as 

||A|| L = sup ^ (3) 

0<A<(fc-a) ^ 

where, 

X(A,x)= sup sup \A(x,y)-A{x',y)\ (4) 

y£[a,fc] x' :\x — x'\<A 

and 

A (A) = sup A (A, a:) (5) 

xe[a,b] 

In words, this condition necessitates continuity of the mapping that takes the estimates of the underlying 
symbol to the corresponding loss incurred. We require that estimates of the underlying clean symbol that are 
close together have corresponding loss values that are also close to each other. 
It can be easily verified that the commonly used loss functions of L^, L\ norms satisfy the aforementioned condition. 

Let T^ a,h ^ denote the set of all probability distribution functions with support contained in the interval [a, b]. For 
F e T [a ' h] , we let 



U(F)= min / A(x,x)dF(x) 



(6) 

6[o,6] 

denote its 'Bayes envelope' (our assumptions on the loss function will imply existence of the minimum). In other 
words, 14(F) denotes the minimum achievable expected loss when guessing the value of X ~ F. Define the 
symbol-by-symbol minimum loss of x n by 



D (x n ) =min£' 
g 



1 - 



n 

i=l 



(7) 



where the minimum is over all measurable maps g : M. — > [a,b]. Dq (x n ) denotes the minimum expected loss in 
denoising the sequence x n , using a time-invariant symbol-by-symbol rule. This can be attained by a "genie" with 
access to the clean sequence x n . Dq(x 71 ), which is the expected per-symbol loss of the optimal symbol-by-symbol 
rule for the individual sequence x n , will be our benchmark for assessing the performance of the universal symbol- 
by-symbol denoiser that we construct in the next section. The same benchmark was used also in [5], This is slightly 
different than the benchmark used in [36], which corresponded to a genie that can choose the best symbol-by-symbol 
rule with knowledge not only of the individual sequence x n , but also of the noisy sequence realization Y n , The 
latter is irrelevant for our current setting where each of the components of Y n will take on a different value, with 
probability one. For x" e [a, b] n , define 

FM*)= l{1 - l - n:Xt - X}l , (8) 

n 

i.e., the CDF associated with the empirical distribution of x n . Note that Do(x n ) can be expressed as 

D {x n )=min[ E x A(x,g(Y))dF X n( x ) (9) 

9 J[a,b] 

where E x denotes expectation when the underlying clean symbol is x, the expectation being over the channel noise 

E x A(x,g(Y)) = [ A(x,g(y))f Ylx (y)dy (10) 

For F £ ^ a ' h \ let F ® C and Ep^c denote, respectively, probability and expectation when the channel input 
X <~ F and Y is the channel output. So that, 



E mc A(X,g(Y)) = I E x A(x,g(Y))dF(x) 

[a,b] 

I A(x,g(y))f Ylx (y)dy dF(x) (11) 

[a,b] UR 

Letting [F <g) C]x\ v denote the conditional distribution of X given Y = y under F ® C, we have 

mmE F ®cMX,g(Y)) = E FrsC U ([F ® C] X]Y ) (12) 

g 

with U denoting the Bayes envelope as defined above. Letting g opt [F] denote the achiever of the minimum in dT~2b . 
we note that is given by the Bayes response to [F ® C]x\ y , namely, 



9o P i[F]{y) = arg min / A(x,x)d[F ®C] x \ v (x) 



= arg min / A(x,x)f Y \ x (y)dF(x) (13) 

x ^\ a MJ[a,b] 

In Lemma fT2l we will establish the concavity of U(F), and minimizing this bounded (by our assumption of bounded 
A) concave function over a closed compact interval, [a, b], guarantees the existence of the minimizer, g opt . Note 
that from (O, ( [Tol l and (TTTb we have 

D (x n ) = minE FiBn9 cA(X,g(Y)) (14) 

9 

where F x ™ was defined in ([8]) and the minimum is attained by <? opt [F x n], Thus, only a "genie" with access to the 
empirical distribution of the noiseless sequence could employ g op t[F x n]. 



III. Construction of Universal 'Symbol-by-symbol' Denoiser and Preliminaries 

F x n and, hence, g op t[F x n] are not known to an observer of the noisy sequence. The first step towards constructing 
an estimate of g op t[F x n] is to estimate the input empirical distribution from the observable noisy sequence, Y n , 
and knowledge of the channel, C. We approach this problem by first estimating a function that tracks the evolution 
of the 'average' density function according to which the noisy symbols are distributed. For an input sequence 
X n , given the memoryless nature of the channel, the output symbols will be independent with respective distribu- 
tions, {F Y \ Xl , • • • , F Y \ Xn } and have the corresponding density functions, {f Y \ Xl , ' ' " > fy\x n }- The function we are 

interested in estimating is 

1 " 

fY(y) = -J2fn*M (15) 

i=l 

which can be thought of as the marginal density, f Y , of the noisy symbols in the semi-stochastic setting where x n 
is the unknown deterministic sequence. The estimation of this function is done by exploiting the vast literature on 
density estimation techniques [7], [6], the details of which are discussed in Subsection IIII-AI below. Once we have 
an estimate f Y = fy [Y n ] for this function, we use it to estimate the input empirical distribution by 

/ \ 



F x ™ [Y n ] = arg min d 



Feri, 



.6] 



/£, / f Y]x dF{x) 



(16) 



V. [F®C]y ) 

where Tn Q J 7 ^'^ denotes the set of empirical distributions induced by n-tuples with [a, 6]-valued components 
and [F <g) C] Y denotes the marginal density induced at the output of the channel by an input distribution F. That 
is, every member, F(x), of Tn ' is of the form 

i ™ 

ff^-VllK,) (17) 

»=i 

for some ?i-tuple, x n = (x\, X2, • • • , x n ), with [a, 6]-valued components. The norm, d, in ( fT6] l is defined as 

d(f,g)= [\f(y)-g(y)\dy (18) 



The channel, C, induces a set of 'feasible' densities of the output noisy symbol corresponding to the family of 
empirical distributions of the underlying clean sequence at the input of the channel. The density estimate, f Y , which 
is constructed only from the noisy sequence, Y n , is oblivious to the set of achievable marginal densities and hence 
could lie outside this set. It is thus natural to estimate the unobserved F x n by the member of Tn ' leading to a 
channel output distribution closest to the estimated one, f Y . This is exactly the estimate in (TToT i. The uniqueness 
of the minimizer in ( fl6l l follows from the fact that the objective function being minimized is a norm-function and 
hence convex, coupled with the linear independence assumption of the channel, C4. The assumption, C4, implies 
a one-to-one correspondence between channel input and channel output distributions (i.e., "invertibility" of the 
channel). Additionally, the search for the minimizer is conducted on a convex set of distribution functions, Tn , 
resulting in uniquely achieving the minimizer or in other words, the candidate input empirical distribution estimate. 



A two-stage quantization of both, the support of the underlying clean symbol, [a, b], and the levels of the 
estimate of its empirical distribution function, F x n, is carried out to give the corresponding quantized probability 
mass function that has mass points only at the quantized symbols. 
Ql. The quantization of the interval [a, b] is depicted in Fig. [T]below. For a given quantization step size, A, the 

F € j^K*] 



F 



>A 



P A (a i ) = F(a i )-F(a i _ 1 ) 



T J T ai T T 

■ H < I t t t I t ■ 



Oi-l 



Fig. 1. Quantization of the support of a distribution function, F £ J r i a M 



quantized symbols, a% in the interval [a, b] are constructed in the following manner. 
For A > 0, N(A) = ^ 7% if m = L^s 21 ]' consider a family of vectors, 

T A = {P A : P A = (P(a ), P(oi), • ■ • ,P( OJ v(A)))} 
yl A = {a !: = a + iA, t = 0, • • • , iV(A)} 

iV(A) 

s.t. ^ P(oi) = 1 

i=\ 

else, define the family of vectors as T A = {P A : P A — (P(ao),P(ai), •• • ,P(ctjv(A)-i)j P( a N(A)))}> 
A A = {a % = a + iA, i = 0, • • • , N(A) - 1}, a N(A) = b, J2?J A) P{a z ) = 1. 

As indicated in Fig. [T] the probability mass function, P A , that we propose is constructed by allocating the 
mass of the distribution function, F, in any quantization interval (of length A) to the higher end point in that 
interval. More precisely, 



P A ( ai ) = F( ai ) - P(o,_i) 



(19) 



where au's as defined above and note that 



P A {B) 



a,i£B 



with any B G B^ a ' b \ £j[ a > b l is the Borel sigma-algebra generated by open sets in [a, b]. 



Applying this quantization of the support of the underlying clean symbol to the estimate, F x n, we construct 
now, the corresponding probability mass function, P A n 

P&fa) = F x n( ai ) - F xn ( ai -i) (20) 

where, a, € A A . 
Q2. The quantization of the values P x n is carried out using a uniform quantizer, Qs 

P^ A = Q S (P A ) (21) 

where, 6 denotes the quantization step-size on the interval [0, 1]. 
This is primarily motivated by tractability of the proof of the asymptotic optimality results. But, it can also be 
argued that any practical implementation of this proposed denoiser only has a finite precision representation of the 
underlying clean symbol and the distribution function values itself. Analysis of the asymptotic optimality results 
also lends itself nicely to viewing the distribution of the underlying clean symbol, F x n, as the asymptotic limit 
attained by its quantized, finite precision representation, P x k . This is formalized in section Ull-CI where we discuss 
the precise convergence notion of P£, to the un-quantized probability measure. 

The minimizer of the Bayes envelope in dT3b is then constructed from the quantized probability mass function, 



-,S.A 



-,S.A 



P x „ ,as <7 opt P x k , where g ovt for the quantized clean symbol is, 

9o P t[P](y) = arg min V A (a, x) ■ f Y \x=a(y) • P (X = a) (22) 

a6-4 A 

A A is finite alphabet approximation of [a, b] corresponding to the quantization step size of A. Note that we have 
extended the definition of g opt to accommodate the case when P is not a valid probability, i.e., P x k (it does not 
sum up to 1). Equipped with P x k , the candidate for the n-block symbol-by-symbol denoiser is now given by 

X n > s > A [y n ](i) = g opt [P!k A [y n ]] ( W ), 1 < i < n (23) 

where, g opt is given in d22l ). We now proceed to discuss in detail the construction and consistency results of the 
estimate, fy, F x « and its quantized version, P x k . 

A. Density Estimation for independent and non identically distributed random variables 

We now obtain an estimator fy, for the function in (fT~5b which depends on x n and therefore unknown to 
the denoiser. Given the memoryless nature of the channel, the sequence of output symbols, Yi , I2 , • • ■ 7 Y n are 
independent random variables taking values in K, having conditional densities, fy\ xi , fy\x 2 1 ' ' ' 1 fy\x n respectively. 
A density estimate is a sequence f 1 , f 2 , • • • , f n , where for each n, fyiv) = f n {y] Yi,- ■ ■ , Y n ) is a real-valued 
Borel measurable function of its arguments, and for fixed n, f n is a density estimate on R. The kernel density 
estimate is given by 

tfM-irt*^) ,24) 

»=i v ' 



10 



where h = h n is a sequence of positive numbers and K is a Borel measurable function satisfying K > 0, f K = 1. 
The Li distance, J„, is defined as 



r 1 " 

•J a n 



dy (25) 

i=0 

The choice of Li distance as elaborated by the authors in [7] is motivated by its invariance under monotone 
transformations of the coordinate axes and the fact that it is always well-defined. Before proceeding to discuss 
convergence results for J„, we present definitions of certain types of kernel functions, K, that are the backbone to 
kernel density estimation techniques, [6]. 

Definition 1: The class of kernels, K, s.t. \/K 6 /C, we have 

K = 1 
and K is symmetric about are called class kernels. 

Definition 2: A class s kernel is a class kernel for which 

and 

x l K{x)dx = 



for all i = 1, • • • , s — 1. Most class kernels are in fact class 2 kernels, the only additional condition being that 
/ |a;| 2 -RT(:z;) < oo. However, nonnegative class kernels cannot possibly of class s > 3. 

Theorem 1: Let K be a nonnegative Borel measurable function on E with J K = 1 of class s = 2. Let fy be 
the kernel estimate in (l2"4t and J n , the corresponding error as defined in d25l ). Consider 

1) Jn — > in probability as n —* oo, for some sequence x = (xi, X2, • • • ) 

2) J„ — ► in probability as n — » 00, for all sequences x = (x\. X2, • ■ •) 

3) J n ~^ almost surely as n — > 00, for all sequences x = {x\,X2, • ■ •) 

4) For all e > 0, there exist r, no > such that P(J n > e) < e~ rn , n > no, for all sequences x. 

5) linijj^oo h = 0, linin^oo n/i = 00 



Then, 5 



The following lemma is key to the proof of Theorem Q] 



Lemma 1: For any family of channel probability density functions, 
{fY\x}xe[a.b] on ^> satisfying assumptions C1-C7, and any non-negative, integrable function K, with J K(x)dx = 1, 
condition 4) in Theorem [TJ holds whenever 

lim h n — and lim nh d = oo (26) 

n — >oo n — >oo 

Proof: [Proof of Theorem Q~) 
The implication that 5 => 4 is proved in Lemma Q] Since clearly, 4 => 3 => 2 =>• 1, the proof of Theorem Q] is 
complete. ■ 

B. Channel Inversion 

The mapping in ( [ToT ) projects the kernel density estimate of — Y^=i fylxXy) to an estimate of the empirical 
distribution, F x n, This projection is such that it best approximates (in the L\ sense), the kernel density estimate 
with a member in the set of achievable channel output distributions. From the construction of fy in (|241l . it is 
clear that fy is a bona fide density on R. Additionally, from the construction of F x n in ( [ToT ). we see that for every 
F G Tn ' , [F C] Y is also a valid density in R. Finally, from the definition of the norm, d, in ( fT8l ), it is true that 
for fy and [F ® C]y being bona fide densities on R, < d (f Y , [F ® C] Y ) < % V, n. These facts, together with 
the convexity of Tn ' show that the estimator in ( [ToT ) is well defined. With the Levy metric defined as: 

Definition 3 (Levy metric): The Levy distance A (F, G) between any two distributions F and G is defined as 

A (F, G) = inf {e > : F(x - e) - e < G{x) <F{x + e) + e for all x} 
we have: 



Theorem 2: For the estimator, F x ™ defined in equation ( [ToT ) we have A I F x n , F x n j — > a.s. for all x € [a, 6]°° 
The proof of Theorem [2] is discussed in detail in the Appendix [III] 

C. Distribution-independent Approximation of the Estimate of the Input empirical distribution 

In this section, we discuss the convergence notion of P£\ to the law corresponding to the un-quantized distribution 
function F x n. 

Definition 4 ((3 metric): For any two laws P and Q on S, f : S — > R let J fd (P - Q) := J fdP - J fdQ, for 
bounded J fdP and J fdQ, the Prohorov metric is defined as 



J fd(P-Q) 



f\\BL<l 



[3 (P,Q) = sup- 

where 

II / HzhHI / \\l + II / Hoc (27) 



and 



/IU=:=sup l/( f /(y)l , || / ||oo=sup|/(x)| (28) 

x# y |»-y| 



Equipped with this definition, we now state the following theorem, 
Theorem 3: 



lim /3(P x n,P£) =0 (29) 

A— >oo V / 

where, P x n denotes the law associated with the distribution function F x n, 
Proof: Follows directly from Lemma [2] 

■ 
Lemma 2: For any F G !F^ a ' h \ 

lim 13 (P, P A ) = (30) 

A^O v ' 

where P is the law associated with distribution functions in the family F^°" b \ Particularly, the F and P A that 
satisfies ( f30b is defined by, 

P A (a l ) = F(a l )-F(a l _ 1 ) (31) 

where a; g _4 A and „4 A is the finite alphabet approximation of [a, b] discussed earlier. 

In words, any empirical distribution of the underlying clean sequence is approximated arbitrarily well with a PMF 

on the quantized set of points when the quantization is fine enough. 

Next we discuss the mechanics of the construction of the denoiser, which has the density estimation and the 
channel inversion steps as its core. 

D. Implementation of the symbol-by-symbol denoiser 

The implementation of the denoiser in the previous section involves a discretization of the density estimation 
and the channel inversion steps. The discretized version of the kernel density estimate, /y(y), in (124b is evaluated 
at a set of discrete points, {yi, • ■ • , y^} . This gives an TV-dimensional vector of the distribution function, Pyiv)- 
The "channel inversion" in ( fT6] l is also discretized using the estimate, Pyiv)- 

1) Fast kernel density estimation: The Kernel density estimation in ( 1241 ) for a given kernel function, K, although 
simple in construction, is faced with a significant computational burden for a brute-force computation of O(Nn) 
corresponding to n data points and N points {yi,--- ,2/at} at which Py{y) is evaluated. The computational 
complexity can be greatly reduced by using FFT based methods [31]. Recently, there has been extensive work on 
the use of fast gauss transform-based techniques [16] for reduction of computational complexity. These techniques 
reduce the complexity from 0(Nn) to 0(N+n). The cardinal factor in nonparametric density estimation procedures 
is the choice of the optimal bandwidth, h, in d24l) . There has been some recent work in [14] on using dual-tree 
methods to derive fast methods for optimal bandwidth choice that continues to maintain the complexity of this step 
at 0(N + n). For N = 0{n), this reduces to 0(n). 



2) Channel inversion using linear programming techniques: In solving the channel inversion problem in ( 1161 ). 
we are looking for a vector in the probability simplex, T A = {P : Yli=i * ' ( a i) > a i e -A A }, for our candidate 
distribution function, P x k . The discretized version of ( [ToT l is given by, 



N 

P' n = are: mm > 

2 — 1 



N(A) 

PviVi)- Yl fY\x=x J {yi)Qs(p(x J j) 



(32) 



The objective function, being an Li-norm, is clearly a convex function (of the input distribution, £>(•)) and the 
candidate minimizer also resides in the convex subspace, viz., the probability simplex T A . This can be easily 
solved using well-studied linear programming algorithms in the broader area of convex optimization techniques. 
The particular reformulation of the problem solved is of the form 

JV 
Jr-n — are mm > £,; 

N(A) 
St. Py(Vi) - X! fY\x=x j (Vz)Qs (p(Xj)) < Si 

i=i 

N(A) 

J2 fY\x=xM)Q&W x i)) - Pyivi) < ei v»e{i,---,JV} (33) 

The computational complexity of solving this problem using the popular interior point methods [2] is 0((N + 
N(A)) 3 ) = 0((N+ i) 3 ) = 0((N + \ogn) 3 ). This again, for N = 0(n), reduces to O Un + log?i) 3 ) = 0(n 3 ). 
The two-pronged quantization discussed in the previous section can be naturally built into the optimization 
problem in ( l32l by searching in 

F SA = {Qs(P) :PeT A } (34) 

the set of N (A)-tuples with components in [0,1] that are integer multiples of i with point masses on the set A A . 
The formulation would then be 

N 
r>$,A ■ ST^ 

P.'n = are mm > e, 

1=1 

N(A) 

S.t. Py(yi) ~ 5Z fY\x=xj(Vi)p{Xj) < €i 

3=1 

N(A) 

Y fY\ x = Xj (yi)p(xj) -PviVi) < £i Vi e {1, • • ■ ,N} 
i=i 
This channel inversion is at the heart of the denoiser in ( 1221 and its simple formulation makes the scheme particularly 

elegant and practically implementable. The estimate of the empirical distribution in d32i l is then plugged into (1221 

to finally give an estimate of the underlying clean symbol according to d23l . The denoiser is described as Algorithm 

Q] below. 

IV. Performance guarantees for the Symbol by Symbol denoiser 

The main result of this section is Theorem [5] below, which establishes the universal asymptotic optimality of 
our proposed symbol-by-symbol denoiser in d23l ) with respect to the class of symbol-by-symbol schemes. The 



input : Noisy sequence y n , channel C 
output: Denoised sequence, x n 

1 FIRST PASS 

2 Density estimation step 

input : Noisy sequence, y n 
output: Density estimate, fy 

3 Determine the optimal bandwidth from any one of the techniques discussed in [31], e.g., cross-validation 

4 Use techniques discussed in [14] for fast evaluation of d24l > 

5 Channel inversion step 

input : fy, Quantization resolutions, 5, A 
output: P 5 X ' A 

6 Construct an LP (Linear Program) as in d33l and use linprog (in MATLAB) or any complex program 
solver to solve it. Alternatively, use log-barrier methods discussed in [3] to solve for the estimate, F x n 

i Use the quantization mapping in d20l ) to map F x n, to P£k 

8 Then use a uniform quantizer with resolution 5 to get P x k <— Qs ( P£k ) 

9 SECOND PASS 

input : Noisy sequence, y n , channel C, estimate of input distribution P x k 

output: Denoised Sequence, x n 
10 Use equation d22b , d23l to denoise at every location, i 
n for i ^ 1 to n do 

12 Xi <- 5opt[-Px" ]{Vi) 

13 end 



Algorithm 1: Svmbol-bv-svmbol denoiser in Section 11111 



predominant technical result leading to Theorem [5] is Theorem |4] We continue to restrict ourselves to the semi- 
stochastic setting where the underlying clean sequence is an unknown, but deterministic, sequence x. The benchmark 
performance for the clean sequence is the minimum possible symbol-by-symbol loss, Dq (x n ), defined in Section HI1 



Theorem [5] shows that our proposed denoiser, g opt 



pS,A 



asymptotically (as the number of observations increases) 



achieves that benchmark performance. This is achieved by bounding the deviation of the cumulative loss incurred 



by ffopt 



pS,A 



from the minimum possible symbol-by-symbol loss in Theorem [4] for any block length, n. Hence 



we show that, g opt P x k performs essentially as well as the best possible symbol-by-symbol denoiser, Do (x n ). 
In preparation for Theorem [4] let jF d,A , defined in d34i l, denote the set of probabilities with components in [0,1] 
that are integer multiples of S (defined under Q2. in section HIIll. Note that P x k G J- S ' A , where P x k was defined 
in (|2TT >. Also, let Qs,A = {ffopt[-P]}p6^,A denote the set of all possible denoisers that can be constructed from the 



members of the set T SA using (|22). Define G(e, B) = §^ 



a„ (e,*,A,p,7) 



1 lA 



2e -G(e+5A max ,A max )n + e _(l-p)SX. 



e -(l-p)«x_ (35) 



z; (e, J, A, A, C) = 3e + 5<5A max + 4£ A A max + 4A(A)(1 + £ A ) (36) 



£a = sup sup \fy\x{y) - fv\x(y)\dy 



(38) 



where 

cefa.bl 

|«-A|<A 

and A(A) is the moduli of continuity defined in ©. The Lipschitz norm, || 2 \\l of £ A is given by 

l|S|U= sup % (39) 

0<A<(6-a) A 

D (x n ) is the symbol-by-symbol minimum loss of x n defined in (0. 

Theorem 4: For all e > 0, 6 > 0, p = p(e, <5), A > and x" G [a, b] n let, 

e 

7 = (|| A || L +A max || H [jl +(b - a) || A || L || S \\ L +A max ) 

then, we have 

Pr (|Ljf„,«, A (x™, F") - D (o; n )| > v (e, <5, A, A, C)) < a„ (e, <5, A, p, 7 ) Vn s.t. n/i„ > ti (C, p, (5, X) (40) 



where, || S ||^ is defined in ( |39l and the form of no in ( 1112b . Note that the tightness condition on the probability 
measures associated with the family of the conditional densities of the channel, C, guarantees that n (C, p, S, K) < 
oo, Vp e (0, 1). Theorem|4]formalizes the fact that the probability of deviation of the cumulative symbol-by-symbol 
loss, Ljy„ ]S ,&{x n , Y n ) from the minimum possible loss, Do(x n ) is exponentially small with the block length n. 

Intuition behind the proof of Theorem 

The benchmark for assessing the performance of the proposed denoiser is the minimum possible symbol-by- 
symbol cumulative loss, Do (x n ). It has been shown in (fl4l . that this is the minimum over all measurable mappings, 
g : R — » [a, b], of the expected loss under the marginal density induced by the true distribution of the underlying clean 
sequence. This has been further shown in ( fT2l to be equal to the expected value of the Bayes envelope under the true 
conditional empirical distribution of the underlying clean signal given the noisy observation. This true conditional 
empirical distribution of the underlying clean signal is the quantity that is unknown to us. However, if we have an 
estimate of this conditional empirical distribution that is in some sense "close" to the true conditional empirical 
distribution and asymptotically is essentially "it", we are on the right track. Since this is derived as a function of the 
marginal empirical distribution of the underlying clean signal, all that is needed is, "closeness" of the estimate of 
the marginal distribution of the underlying clean signal to the true marginal empirical distribution. The almost sure 
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convergence of the marginal density at the output of the memoryless channel gives us, through the mapping in ( [ToT l. 
an estimate of the input empirical distribution that weakly converges, as shown in Theorem [2] to the true empirical 
distribution of the underlying clean signal. This then subsequently lends itself to the convergence of the expected 
loss under the corresponding induced densities at the output of the memoryless channel. From (fT2l i and ([Til l, the 
fact that we have well-behaved (satisfying conditions C1-C7) channel conditional densities,{,/V|x}xe[a,6]> and loss 
function, A (satisfying conditions L1-L2), we can bound the deviation of the expected value of U \[F ®C] X , Y ) 
under the two corresponding induced densities. 

The goal, eventually, is to bound the deviation of the cumulative loss, Lj^ n:S ,A, incurred by the proposed denoiser 
in d23l ) from Dq (x n ) as a function of the block length, n. This is done by using Lemmas [5] |6] which formalize the 
deviation bounds of the expected loss under densities induced by weakly converging distributions. Finally, Lemma 
[7] is used to bound the deviation of the empirical expected loss from the true expected loss. These Lemmas are 
analogous (in spirit) to the corresponding ones, i.e., Lemmas 1, 2, 3 (for context length, k — 0) in the discrete-input, 
general valued output setting in [5]. There are, however, subtle differences in the bounds and the requirements on 
the channel, loss functions (Cl-7, Ll-2) that make it possible in this continuous valued setting. The combination 
of these results is used to bound the deviation of Lj>„ )ltA from D (x n ) in the proof of Theorem [4] Take now, 
S = S n , A = A„ such that 5 n j. 0, A„ j. for all e > and 

OO 

\] a n (e, 6„, A„, p, 7) < 00 (41) 

n=l 

For example, S n , A n = j-^- would satisfy the above requirements of summability and growth for any e > 0. With 
the growth rates that satisfy the summability condition in fiTI ) for a n (e, S n , A n , p, 7) let, 

A ssuniv - A (4Z) 

where the subscript 'ssuniv' stands for symbol-by-symbol universal denoiser. A direct consequence of Theorem [4] 
and the Borel-Cantelli lemma gives us the following main theorem that establishes universal asymptotic optimality 
of our proposed symbol-by-symbol denoiser for any unknown individual underlying clean sequence, x . 



Theorem 5: For all x £ 



lim 



L^(x n ,Y n )~D (x n ) 



= a.s. (43) 



V. Construction of the Universal Denoiser and its performance guarantees 

In this section, we propose an extension of the symbol-by-symbol denoiser discussed in previous sections to a 
2k + 1-length sliding window denoising scheme, one that competes with sliding window schemes. The performance 
guarantees made in the symbol-by-symbol case also hold in the proposed extension. The first result of this section 
is presented in Theorem 6, which assess the performance of our proposed scheme by showing that it does well 



relative to that of the best sliding window scheme of order 2k + 1, as would be chosen by a "genie" that knows the 
underlying clean sequence x n . The main result of this section is Theorem|7] which establishes the strong universality 
of our proposed sliding window denoiser, showing that it does essentially as well as any sliding window scheme, of 
any order, as the length of the data increases, regardless of what the underlying clean sequence may be. Theorem [7] 
will be shown to be a direct consequence of Theorem|6] analogously as Theorem|5]of the previous section followed 
from Theorem [4] 

A. Extension to competition with 2k + 1-order sliding window denoisers 

The scheme we propose is pictorially depicted in Fig.|2]below. The necessity for independence of the symbols in 
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Fig. 2. Schematic representation of the 2fc + 1-length sliding window denoiser 



the density estimation procedure discussed in section IIII-AI coupled with the memoryless nature of the channel is 
the motivation for partitioning the problem into subsequences that are processed similarly, but separately. A 2fc + 1- 
tuple super-symbol is formed by jumping a length of 2fc + 1 to achieve the independence condition between the 



successive super-symbols. Note that there are 2k + 1 such subsequences and each subsequence, i (counting in the 
order of symbols in the sequence), consists of [" ""ffc+i -1 "!' 2fc + 1-tuple super symbols. We label the subsequences 



as x ni , for 1 < i < 2k + 1. For a fixed n, each subsequence x ni has the following super symbols, 



Xr- 



'-]-!) (2k+l)+i+2k 



2k+i Ak+l+i 

1 ' 2fe + 1+i '"" ^(r "- 2 2 fc fc +i"' i-i)(2fc+i)+» 

This facilitates the extension of the ideas from the symbols of the symbol-by-symbol denoiser to the super-symbol 
of the 2fc + 1 sliding window denoiser. Some definitions are in order before we set to investigate the optimality 
results of the scheme. As in the symbol-by-symbol scheme, let fy denote the k th order density estimate of the 



noisy sequence of symbols and is computed exactly as in (l24l except y, Yi 6 M 2fe+1 . Denote J^i a ' b ^ k to be the set 
of all probability distribution functions with support contained in the hypercube [a, b] 2k+1 . Let Dk(x n ) denote the 
fc^-order sliding window minimum loss and is defined as 

n— k 

£ A(x,, ff (y/_+f)) 



D k (x n ) 



mini? 

9 



-2k 



i=k+l 



(44) 



Note the similar definition of symbol-by-symbol denoisability in (0. As before, Dk(x n ) can be expressed as 

D k {x n ) = rmnE P k a9C A{X,g(XJl k )) (45) 

where F k „ is the fc th order empirical distribution of the source. Define further the sliding window denoisability of 
the individual sequence x = (ari, X2, 2:3, • • • ) by 



D (x) = lim limsupDfe(a; n ) 

K *00 -vi — >rsr, 



(46) 



where the limit exists by monotonicity. In words, D(x) is the loss of a genie who knows the underlying clean 
sequence and can choose to denoise with the best sliding window scheme, of arbitrary order. Extending the definition 
of fc th -order minimum loss to a subsequence, x rii as 



D k (x ni ) 



9 x % 



(47) 



The mapping to the corresponding fc* order input empirical distribution is given by 



/ 



F% n [Y n ] =arg min d 



FEfi"'"* 



\ 



rn.k 



, / f[ f Y \ Xi dF(o 



i= — k 



V 



(48) 



[F®C]* 



/ 



where Tn Q ^ a ' b ^ k denotes the set of k th order (1 < k < [§J) empirical distributions induced by n-tuples 
with [a, 6] 2fc+1 -valued components. P x k ' denotes the fc-th order estimate of the input empirical distribution of the 
source analogously defined as in the symbol-by-symbol case. The 2k + 1-length sliding window denoiser for each 
of the subsequences, i, is given by 



X 



rii,8.A,kr 



\y n ](j) = 5o P t [PZ£ ' Vl] (y] + S) . 3 e {* + h 3k + 1 + i 



n — 2k — i — 1-, 
' 2fc + l ' 



(49) 
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where the k th order equivalent of the denoiser in ( f22b is given by 



ffopt[-P] (y1 k ) = axgmmA(-,x) T [P®C] 



xeA 



\u\ v k ; 



aremin > A (a, £) 



ze.4 



E 



ae.4 



u* t £A 2k+1 :u =a Li=— fe 



n /yi«.=««(w)^^-*=«-fc) 



(50) 



Let, -F^a denote the set of 2fc + 1- dimensional vectors with components in [0,1] that are integers multiples of 6. 



S<5,A 



Note that, P°£[z n >] e F| A for all z". Finally, let g$ A = {g op t[P]}p e ^ and 



A '*'—-(_ A }l<i<2fc+l 



(51) 



be our candidate for the n-block 2fc + 1-length sliding window denoiser. It is the sequence of 2fc + 1 denoisers 
that operate individually on each of the subsequences. The cumulative loss incurred by this sequence of denoisers 
is defined as 

2fe+l 

— / Iyn,,S,4.t (52) 



^Xn^.A.k 



2k 



where, Lj^ ni , s ,&,k is the cumulative loss incurred by the proposed denoiser for the i th - subsequence. The following 
Lemma illustrates a rather intuitive fact, the average minimum k th order sliding window loss incurred by operating 
on each of the subsequences is at most the minimum fc th order sliding window loss for the entire sequence. 



Lemma 3: For all n > 1, &<[§J 



1 



2fc + 2 



D k (x n >) < D k (x n ) 



(53) 



B. Performance guarantees 

In this section we present Theorem |7J wherein we demonstrate that, provided certain growth constraints on the 
context length k, quantization step sizes S, A and width of the kernel density estimate h are satisfied, the cumulative 
loss, ij-n.s.A.k, incurred by the proposed denoiser asymptotically approaches the sliding window denoisability. The 
growth constraints are specified at the end of this section. They are dictated by an exponential bound on the deviation 
between the cumulative loss, L-yn,s,h,& and D k which we now develop. 
Let 



On (e, ft, S, A, p, 7) 



1 



A 2 



[A (k, e + <5A max , A max ) exp (-(n + V)G (ft, e + <5A max , A max )) + 



A ( k, y/l-p, - J exp ( -(n + 1)G f ft, yjl-p, - 



1 e 1, 1 Pi 2(2fc+i) 



where, 



and 
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A(k,e,B) = (2k + l)exp(^\ (54) 

G ^v = w^w (55) 



v (e, S, A, A, C, k) = 3e + 5<5A max + 4^ fc+1 A max + 4A(A) (l + ^ fe+1 ) (56) 



We now state the analogue of Theorem [4] in the present setting, which bounds the deviation of the cumulative 
loss incurred by the proposed 2fc + 1-length sliding window denoiser from the minimum possible D k (x n ). Note 
that here, x G [a,b] 2k+1 and Y G [a,6] 2fc+1 ( 2fc + 1-tuple super-symbols) is the continuous valued output of the 
memoryless channel. 

Theorem 6: For all n > 1, e > 0, 5 > 0, p = p(e, (5) defined in (03, A > 0, 1 < k < [f J and x n G [a, b] n 

Pr (Ljfn.i.A.* (x n ,Y n ) - D k {x n ) > v (s, <5, A, A, C, *)) < a n (e, fc, <5, A, p, 7fc ) V n s.t nfc* > n fc (C, p, tf, if) 

(57) 
where, 

-Yi- = (58) 

(II A \\ L +A max || S HI +(6 - a) || A || L || S ||| +A max ) 

||S||| (the fc th order equivalent of ||3||l m d39b ) and n k (C,p,S,K) are defined in dl59t and ( II 10b respectively. 
Take now, k = k n , 8 = 6 n and A = A„ such that k n — ► oo, S n I 0, A„ J, 0, 

oo 

^a„ (e, kn,S n , A„,p,7 feri ) < oo 

n=l 

and rife (C, p, (5, K) < oo. With growth rates that satisfy these conditions let, 

A^ uv = ^"^ (59) 

For example, it can be verified that unbounded increasing fe„ = log (log(n)), h n — lo 3„\ , <5nfcn — * 0, I <5„ , A„ = lo \ -, 
satisfies the requirements for a family, C, that has <5^"" +1 — > and loss functions that have A (A n ) <5^*" +1 — » 0. 
Particularly for additive Gaussian noise channels of finite variance, squared and absolute loss functions with the 
aforementioned growth rates of k n , A n , S n satisfy the conditions of A (A n ) 5\ ™ +1 — > and 5\ *" +1 —* 0. 
We now have the following result as a direct consequence of Theorem [6] and the Borel-Cantelli Lemma. 
Theorem 7: For all x G [a, 6]°° 



lim 



L jtn {x n ,Y n )-D kn {x n ) 



= a.s. (60) 



In fact, we can go a step further and show that the lim sup of the cumulative loss incurred by the proposed denoiser 
is bounded by the sliding window denoisability. Specifically, 
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Corollary 1: For all x s [a, 6]°° 

limsup \Lji-n (x n ,Y n ) - D(x) -■■ (I a.s. (61) 



n — >oc 



which is a corollary of Theorem [7] proved similarly as corollary 1 in [5]. 



C. Computation complexity of the proposed denoiser 

Let us summarize the computational complexity of the proposed denoisers: the "symbol-by-symbol" and the fc th 
order extensions. For the symbol by-symbol denoiser, we have already covered the analysis in Sections IHI-D.ll 
IIII-D.2I For X" iv defined in ([59), we have: 

a) Symbol-by-symbol scheme: 

1) Fast Kernel Density Estimation, 0(n) 

Using the techniques of fast kernel density estimation in [29], [28], [23], [14] it was shown that the complexity 
can be reduced from 0(n 2 ) to 0(n). 

2) Channel Inversion, O (n 3 ) 

The polynomial complexity of the simplex approach in linear programming problems is discussed in detail 
in [2]. 

b) k' h order sliding window scheme: 

1 ) Fast Kernel Density Estimation, O (n) 

As before, the complexity of the denoiser continues to be linear in the length of the data, n and the context 
length, k, i.e., O {nk' 1 ) 7 > [14]. 

2) Channel Inversion, O (n 6fc ) 

From the fact that the dimensionality of the contexts is length 2k, the channel inversion now increases in 
complexity exponentially and is given by O (n ek ). Thus, our schemes are practical for small values of k, but 
become unrealistic to implement as k grows. 

This lead to our follow up work in [33] that uses quantized contexts in conjunction with the (low complexity) 
symbol-by-symbol denoiser that asymptotically (with increasing levels of quantization of the contexts) achieves 
the performance of the sequence of denoisers proposed here. 

VI. Universality in the stochastic setting 

Our results also imply optimality for the stochastic setting when the source (clean signal) is a stationary stochastic 
process with distribution i*x- For the pair (Fx.,C), define the denoisability, D(Fx,C), as 

D(F X ,C) = lim mm EL ^ (X n ,Y n ) , (62) 

n — >00 j(n 

where the expectation is assuming X n are the first n symbols emitted by a source with distribution Fx and Y n is, 
as before, the n-tuple of output noisy symbols from the channel C that corrupts X n . This is achieved by a "genie" 
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that has access to the true distribution, Fx, of the underlying clean signal, X. It has been shown in [36], [5] that 
the limit in (l62l exists and hence the denoisability, B(Fx,C), is well-defined for every stationary Fx. 

We now state the main result for the stochastic setting wherein we establish that for any stationary underlying 
clean sequence X ~ Fx, the expected cumulative loss incurred by our proposed scheme asymptotically achieves 
the denoisability, D (F x , C). 

Theorem 8: For all stationary X 

lim EL^ (A",y n )=B(F x ,C) (63) 



If X is also ergodic then 



limsupi^„ (X n ,Y n ) =B(F X ,C) a.s. (64) 



Given the results established for the semi-stochastic setting, the proof is analogous to that of Theorem 3 in [5] 
except for some subtle differences in our setting due to the continuous input and output alphabets. We, however, do 
provide the proof of the above statement for completeness and for accommodating these differences in Appendix 

Em] 

We conclude this section by comparing the proposed sequence of denoisers to the DUDE-like schemes in [5] 
for the case of finite input (or underlying clean data) and continuous valued output (noisy data) . By a minor 
modification, the proposed denoiser collapses to that in [5] when, as in the setting onf [5], the channel input 
alphabet is finite. This is illustrated by comparing the first pass of the DUDE-like denoiser with a modified version 
of the proposed scheme through the schematic representation in Fig. [3] The theoretical details of the equivalence 
of the modification shown in Fig. [3] below to the denoiser in [5] are elaborated in Appendix IIXI 

VII. Experimental Results 

In this section, we discuss experimental results of applying the proposed scheme to denoising 256-level gray scale 
images. We demonstrate efficacy of the scheme with results of its application to cases of additive and multiplicative 
Gaussian noise. In addition, we consider a highly nonlinear, non-conventional noise distribution: a locally varying 
Rayleigh noise whose variance is a function of the gray level of the underlying clean image. The first pass of the 
denoiser is performed using a Fast Kernel Density Estimation approach proposed in [15] and a channel inversion 
procedure. This channel inversion is performed using a convex optimization linear programming technique that maps 
the output fc th -order density estimate to the corresponding input fc th -order input empirical distribution in accordance 
with (08}, The experimental results presented in this section have been obtained by implementing the scheme 
of the previous sections, with no heuristic modifications that are likely to boost the performance. The practical 
implementation aspects are discussed in greater detail and depth in [32], [33]. 

The first example we consider is, denoising of the boats image that is corrupted by an additive white noise 
channel (AWGN) with, a = 20. The loss function, A, to be minimized in this case is the squared error between the 
true clean image and our denoised estimate. The denoiser in this case is a mapping from R — > A = {0, • • ■ 255} 
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\ Qa) ► 



X n Z n 

► U(x,z) ► 



Fig. 3. Modification to our proposed scheme that is equivalent to that in [5] 



and reduces to that in (l50t , Results of the proposed denoising scheme are shown in the Fig. [5] below with context 
length, k, ranging from 1 to 6. The context (for k > 1) around any location, i, in the block of noisy data are 2D 
neighborhoods. The 2D contexts for various values of k are shown in Fig. [4] below. As is evident from both, the 
reported Root Mean Squared Error (RMSE) figures and the perceptual quality, we are able to achieve improved 
denoising performance with increasing context lengths. Finally, we compare the results of the proposed scheme 
to that achieved by wavelet-based thresholding scheme [9] and Bayesian Least Squares Gaussian Scaled Mixture 
(BLS-GSM) denoiser in [26]. Increasing context lengths, k, translates to accruing increasing fc th -order statistics 
from the finite block length data. This is the classic trade-off between increasing context lengths and reliability of 
the associated higher order statistics is seen in Fig. [6] where we see only marginal gains in the RMSE between, 
k = 4 and k = 6. The results for the AWGN case are primarily aimed at demonstrating the practicality of the 
proposed scheme fully acknowledging the performance lead of schemes like the BLS-GSM that are particularly 
catered to the problem of denoising in the case of AWGN channels. The benefits of the proposed approach are in 
fact highlighted in unconventional cases like nonlinear noise channels which will be discussed next. 

Another example of the application of the proposed scheme is in denoising an image corrupted with an uncon- 
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Fig. 4. 2D Contexts for context length, k 



ventional distribution as discussed earlier in this section. More specifically, we simulate the noisy image by using 



a gray-level dependent Rayleigh distribution (with probability density function, f(x) 



/,- 



e 2b- 



") whose variance 



parameter, B, is chosen as a function of clean image's gray level at that location. In this particular example, we 
generate a matrix of 256x256 Rayleigh distributed random variables whose parameters B are chosen according to 
the following rule, B(i,j) — I(i, j)*35/256, where I(i,j) is the true value of the clean image at location (i,j). We 
will discuss the denoising performance only in the symbol-by-symbol case in this setting in favor of succinctness 
to convey the point of efficacy of the proposed scheme. More detailed results and discussions on this problem 
setting can be found in [32]. We compare, in Fig. |7J the empirical distribution estimate, F x n, of the underlying 
clean image with the histogram generated from access to the "true" clean image. We also compare these results to 
the smoothed histogram estimate of the true clean image that was produced using the Kernel Density estimation 
approach in [15]. From a visual inspection of the figure, it is evident that we are able to reasonably recover the true 
marginal empirical distribution of the underlying clean image and correspondingly the estimate of the true image. 
Finally, we present the results of denoising the boats image that is corrupted by a multiplicative Gaussian noise 
with a distribution, 7V(1,0.2) in Fig. [8] The noise in this case literally multiplies this case literally multiplies the 
original clean image to corrupt it and as such, the effects are relatively more catastrophic. We compare, qualitatively, 
the results from the proposed denoiser with that of [26] to validate its efficacy. 



VIII. Conclusion and Future Directions 

We have presented a family of schemes for denoising continuous amplitude signals that is universally optimal. A 
salient feature of our setting and results is the wide generality of channels and loss functions for which they apply. 
The techniques presented in this paper draw from the "DUDE framework" in [36]. A weighted 'context aggregation' 
was suggested in [36] as an approach to enhance the performance of the DUDE in the first pass of the statistics 
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collection. The proposed technique provides a natural context aggregation mechanism whereby neighboring contexts 
in addition to the observed are weighted by the kernel in the density estimation step. The denoiser proposed in 
[5] was shown to be asymptotically universal and extended the domain of applicability of DUDE-like schemes to 
cases where the noise is continuous valued. This approach, even though elegant theoretically, suffers from some of 
the same issues as the DUDE in terms of sparseness of statistics for large alphabet sizes. Our technique addresses 
this problem for the problem setting considered in [5] by natural context aggregation induced by the kernel density 
estimation. In the setting where the underlying clean signal is discrete-valued, taking values in a finite alphabet 
space, a slight modification of our scheme has been shown to reduce to the scheme in [5]. We also simultaneously 
provide a framework to address the case of continuous valued alphabets, where there is need to learn distribution 
functions instead of individual mass points as in the discrete-valued case. Finally, the proposed scheme is practical 
and tractable in its computational requirements as demonstrated by the experimental results. 

The experimental results in this paper seem promising enough to motivate further exploration of practical aspects 
of the proposed scheme. This is an interesting future direction that is currently under investigation. Additional 
directions of research include studying the applicability of recursive density estimation techniques discussed in [18] 
in designing recursive denoisers as an alternative to the scheme presented in this paper. This would be particularly 
useful in multidimensional data applications like denoising noise corrupted video. It could also be of theoretical 
interest to understand the implications of a recursive structure to the denoiser and its associated optimality results. 

Appendix I 
Conditions on the channel 

In addition to conditions C 1 -C4 in section [II] the following conditions on the channel (noise distribution) round 
up the necessary assumptions for the performance guarantees made in this work. 
C5. The channel satisfies the uniform Lipschitz continuity condition, 



sup \\fy\x(y) \\bl < oo (65) 



where 



ll/y|x(y)l|sL = \\fY\x(y)h + \\fY\x(y)\\oo (66) 

\\f Y \ x (y)h = sup iM^tJhM <00; y yeR (67) 

|JA>(y)||oo = sup f Y \ x (y) (68) 

x£[a,b] 

C6. The conditional densities, additionally, satisfy the following Lipschitz continuity condition, 

||S|| L = sup %<oo (69) 

0<A<(6-o) A 

where, £a is defined in ( 1381 ). 
C7a. The family of conditional densities, C, have uniformly bounded second order universal derivatives, i.e., 3 a Be 
s.t. < Be < oo and D\ (fy\ x ) < 6c, Vx s [a, b], where the second order universal derivative is defined as 
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(refer [6] for further details) 

" dy (70) 



i>h) 



D 2 (Iy\x) = ^inf / (f Ylx * 

(f>h(x) = r<^(f)) 4> S C°°, C°° is a set of functions that have infinitely many continuous derivatives with 
compact support and /' s ' denotes the s-th derivative of /. This is a mild technical condition that enables 
the proof of the convergence of marginal density estimates at the output of the memoryless channel to the 
true marginal density. Note that we are not imposing the differentiability of the conditional densities of the 
channel themselves. We are, instead, proposing a milder constraint that the smoothed version of the channel 
conditional densities is "differentiable enough". This condition is trivially satisfied if we have a family of 
conditional densities that have a uniformly absolutely continuous derivative. 
C7b. An alternative to the previous condition on the family of conditional densities of the channel is, limw^o ^c (i) — 
0, where 

ftcCO = sup Lu x (t) (71) 

x£ [a, 6] 

and 

Ux{t) = j \fv\x{y -*) - Iy\x{v)\ dy (72) 

From the fact [37] that, for any / e Li(R), the corresponding, Li-modulus of continuity, 

u (t) = / |/0 - 1) - f(x)\ dx -» 0, as |t| -> 

and 

N| 00 <2[|/[[ 1 <oo 

it follows that the global L\ -modulus of continuity, £lc(t), is well-defined for all t and families of conditional 
densities, C. In other words, this condition demands uniform convergence of the Li-moduli of continuity of 
the individual members comprising the family of conditional densities. 

Appendix II 
Proof of LemmaQ] 

A theorem necessary for the proof of Lemma Q] is as follows 

Theorem 9: Every kernel K with J K = 1, K > is an approximate identity, i.e for linin^oo h n = and every 
fi G L\, s.t. D% (fi) < oo are uniformly bounded we have 



■»/|(i£:/.).*.- (ip t 



An alternate formulation of the approximation identity is the following, 
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Theorem 10: Every kernel K with J K = 1, K > is an approximate identity, i.e for lirrin^oo h n = and 
every fa e L\, s.t. lim.| t |_ >0 Qc(*) = 



lim 

n— >oo 



/IGs*) •*■-(;£*) 



= 



A definition regarding the notion of an associated kernel, L, with the kernel, K that is necessary for the subsequent 
proof is, 

Definition 5: The function L defined by 

L(x) = {-If J^ {y ~^ K{y)dy (x > 0) 
L{-x) = {-l) s L{x) < 0) 

is the kernel associated with kernel K. The function L is sometimes said to have a parameter s since it figures in 

the definition of L. When K is symmetric, L is symmetric. 

Furthermore, 

J\L\<^J\x\ s \K(x)\dx (73) 

for all nonnegative integers s. For s = 0, we define L — K. For if > 0, we have the equality 

J\L\ = ^J\x\ s \K(x)\dx (74) 

Finally, 



/*L = f ^K(x)dx 



(75) 



: s odd 

: s even, and the order of K is > s 
Proof: [Proof of Theorem [9) 
Let us start with the case that fa has s — 1 absolutely continuous derivatives. Then, by Taylor's series expansion 
with remainder, 



fa{x + y)~f l {x) = Y^ y l^\x)+ I 



x+y {x + y-uy-\ (s) 



(-DI 



fr(u)dn 
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so that, for class s kernels K, 



\ J \Y,h( x + y) - J2fi( x )) K hMdy (recall that J K = 1) 

1 " 

n ^— ' 

»=l 

1 " 

n -^ — ' 






(x + j/-M) S * ( 8 ) 



( s -l)! 



fr'(u)duK hn (y)dy 



/f s W ( ^-TV a^ 



/< w («) 



(x + j/ — w) s_1 



-Kfc* (y)dy du 



1 ™ 

= -n 

71. £~t 



i = l 



1 - 
n £ — ' 



i=l 



ti s \u)(-iy(L) hn (u-x)d<u 



ri 8) (u)(-i)(-iy(-iy(L) hn ( x -u)du 



fl 3) (u)(L) hn (x-u)du 



i n 



(76) 



where (£)/ ln is the kernel associated with X/ lre and L is the kernel associated with K. Therefore, by Young's 
inequality [30], 

n 



i=i 



< 



He|/, 



f(s) 



\L\ 



(77) 



Since /j's have (s — 1) absolutely continuous derivatives, J /j < oo, and further if J |/| s '| < M < oo, Vi 
(uniformly bounded) the inequality in dTTt simplifies to 



/lt§*K-tl>) 



< hlM \L 



Since, 



f\L\<- I \x\ s \K(x)\dx = B K < oo 
for K being an order s kernel, inequality in equation (l78l l becomes 



/|(=S") •*--(£*) 



< «MB X 



(78) 



(79) 



(80) 
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Taking limit mooon either sides, we get 

This can be extended to the general /j's using the universal derivative defined earlier. As a reminder, 

0) 



< lim h s n MB K = 

n— >oo 



D* s (fi) * liminf / (/< * 0,0 

fij.0 



(81) 



(82) 



where, is a mollifier. 

Mollifiers are class kernels, nonnegative and zero outside [—1,1], They also have infinitely many continuous 
derivatives and is called a mollifier because of its exceptional smoothing properties. An example of a mollifier is 



K(x) = Ce T-^, | x | < 1 



(83) 



For a class s kernel, K, and a family of density functions {/i}i<=N with associated universal derivatives that are 
uniformly bounded, i.e., D\ (fi) < Be < oo, V« G N, it can then be shown that, 



/ 



1 ™ /■ 

< -J2KD* s (fi) \L\ 
n i=i J 

1 - /• 

* -Yl h nBcj\L\ 

= KB e J\L\ 



Taking limits on both sides we get, 

lim 



JUP^-kP) 



(84) 



(85) 



Proof: [Proof of Theorem [TU) 

fi(x) = fi(x) J K,M)dl : ■ / /;(.,") /v',, (/)<//. vV 
Therefore, 



(86) 



-£/ ( *jirJ(ao--X;/i(ao = / -£/*(* -*)--£/*(*) ** 

j=l / i=l J L i=l i=l 

n r-f n f-f 



(t)d< 
|jr h (t)|t|ff h (i)|7"di 



(87) 
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where i + h = 1, y\ =0 if p = 1 ) . Applying Holder's inequality with exponents p and p', and then raising both 
sides to the p th power and integrating with respect to x, we obtain 






< 



£/<(*-*) ---E/'( a 



|#j.(t)|<ft 



= 11^1 



1 n 1 n 



(x) 

\K h {t)\dt 
\K h (t)\dt 



< 



K \\f J ~Y,j\f& - *) - f^\ p i^wi dt 



dx 



dx 



dx 



dx 



(88) 



Changing the order of integration in the last expression (which is justified since the integrand is nonnegative), we 
obtain 

/ n \ n . n 

ii -E/< ** fc_ nE/*ii5 ^ w/iiffcWi-E^*)* 

\ i=l / i=l ^ 8=1 



< ||^||f / |iif fc (t)|n(t)dt 



(89) 



For (5 > 0, 



h = / |ff fc (t)| n(t)dt = / + / = A h . 5 + B h . s (90) 

J J\t\<8 J\t\>6 

Since, we have Vt(t) ~ * as \t\ — > 0, for 77 > 0, we can choose <5 so small that fi(£) < 77 if \t\ < S. Then 

A h ,8<v[ \K h (t)\dt<r)\\K\\u Vfc>0 (91) 

./|t|<<5 

Also, 17 is a bounded function by Minkowski's inequality [note that ||f2||oo < sup ieN ||wi||oo _• sup ieN (2||/i|| p ) p , 
which for p = 1, becomes ||f2||oo < 2], so that Bh.s is less than a constant multiple of Li >(5 |^/i(i)| dt, which 
tends to zero with h. This proves that Ih — > as h — > and the theorem follows. ■ 

Another lemma necessary for the proof of Lemma [T] is the following. 

Lemma 4: (A Multinomial distribution inequality) 

Let JVi, • • • , Nk be a multinomial random vector with parameters n,pi, ■ ■ ■ ,pk- Then 



Proof 

By Scheffe's theorem, 



We 



£ 



N; 



> e < 2 fe+1 e^ 



JVi 









-Pi 


= 2 sup 


n 




4 



JV(A) 



P(A) 



(92) 



(93) 



where, A = {all 2 fc possible sets of integers from 1, • • • , k} and AT (A) is the cardinality of A. By Bonferroni's 
inequality and Hoeffding's inequality, 



P I sup 



N(A) 



-P(A) 



> - I < 2 K 2e" 



(94) 
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The expected value of f n (x) is denoted by, 

i ™ r 

g h { X ) = E(r(x)) = — d J2 J K 



x-y 



fi(y)dy 



(95) 



Proof: [Proof of Lemma Q~) 
Let gh be defined as in J95l ). By Theorem 1, it is enough to show that J \f n (x) — g^x^dx — > exponentially. 
Let [i n be the empirical probability measure for X\ , X2, ■ ■ ■ , X n and note that 



rw = Ts[ K ( E ir)"" l ' l!n 



(96) 
(97) 



For given e > 0, find finite constants M, L, TV, a\, ■ ■ ■ , a^r and disjoint finite rectangles A\, ■ ■ ■ , An in M. d such 
that the function 

N 

K*(x) =J2 ailA *W (98) 



(=1 



satisfies: \K*\ < M,K* = outside [-L,L] d , and / \K{x) - K*(x)\dx < e. Define g* h and /"* as g h and f n 
with K* instead of K. Then 

\f n (x)-g h (x)\dx< [\f n (x)-f n *(x)\dx+ [\r*(x)-gZ(x)\dx+ I \g* h {x) - g h {x)\dx 



< 



1 



K * { E^l)-K< X - y 



K * { ^y\. K ' x - y 



h d 

1 
— T 

nh d *-*> 

i=\ 

+ j \f n *(x)-g* h (x)\dx 
<2e+ [\r*(x)-gt(x)\dx 



[i n (dy)dx 

fi{y)dydx 



by a double change of integral. But, if /1 is the probability measure for /, 

/ \ r{ x) - gUx )\ d x < £ k | / -L £ J fMdy _ 2. J 

I — 1 J — 1 z 



Vn(dy) 



-hAi 



N 



i—1 



1 " 

- } fjiAx - hAi) - Mn(x - hAi) 



j=l 



d.r 



dx 



(99) 



Lemma Q] follows if we can show that for all finite rectangles A of '. 



1 N 

7d Z^ 



h d 



i=l 



1 " 
n *. — ^ 



i=i 



(is — > exponentially as n — > 00 



Choose an A, and let e > be arbitrary. Consider the partition of M. d into sets B that are d-fold products of 



intervals of the form 



^ N ' , ^ ) , where i is and integer, and N is a new constant to be chosen later. Call the 
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partition IT. Let 



and 



n 



i iv 



A*=n 



1=1 



Xi + N , Xi + a, N 



Define 



Clearly, for any n 



C x =\x-hA- \J B \ Cx + h(A-A*)=C* 



Ben 
BCx-h 



1 - 

— y fj,j (x — hA) — /i„ (x — hA) 



i=i 



f 1 " 

dx ^ / E \-Y.^ B )-^ B )\ dx 

J Ben j = \ 

BC-x-hk 

+ / u|> + A ^ 



The last term in dlOOt equals 

2\{h{A-A*)) = 2h d X{A-A*) 

I d d 



N 



where A is the Lebesgue measure. Now, putting ( 1 1021 ), ( 11001 ) and J99l ) together, we get 



\P(x) - g h (x)\dx < 2e + !/"*(*) - g* h (x)\ 



N n N 

A 1 <J n^r, „' 1 A 1 '' 



1 N 

< 2e +^ENE 



i=i Ben 

JV 



Ben 7—1 

BCx-hAi 



1 " 



r N 2 

/ dx + V|a,|— h d \{Ai-Ai*) 

JBCx-hAi „-_-, " 



8=1 

JV 



1 1 2 



8=1 sen 3=1 



CAT \ 1 n JV 

£ kiA^) E i- E ^( B ) - ^( B )i + 2 E n a ( a * ~ A *)) 
j=i / Ben n j=i i=i 



(100) 

(101) 
(102) 



(103) 



The third term on the right hand side can be made smaller than e by choosing N large enough (A* — > Ai, Vi as 
jV — » oo). The coefficient of the first term on the right hand side is equal to J \K*\ < 1 + e. Thus, we have shown 
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that for every e > 0, we can find N large enough such that 

f 1 n 

/ \f n {x) - g h {x)\dx < 3e+(l + e)^|-^ Mj (S)-Mn(B) 

J sen n j=i 

1 ™ 

^ 5e +E bE^) -m b )i 



n 
sen j=i 



(104) 



We are almost in a position to use the multinomial inequality were it not for the fact that the partition II is infinite. 
Thus, it is necessary to "cut-off the tails of the distribution. Consider a finite partition, II r , consisting of sets of 
IT that has a non-empty intersection with [— r, r] d where r > is to be picked later. Let II* be IL, (J[— r, r] d . The 
cardinality of IT r is at most 

(— + 2 ) =°<"> 

To take care of the tails we argue as follows: let T stand for the tail set, i.e., the complement of [— r, r] d . then 



E 

sen 



1 ™ 

-£>i(£)-Mn(B) 



J=l 



< 



E 

sen,. 



1 " 



.7 = 1 



<E 

sen,. 



1 " 

-^ W (B)-^(B) 



j'=i 



* E 

Ben r . 

< E 

sen r . 



1 

2 -E^'( r ) 

i=i 
1 " 
i=i 
1 " 



1 ™ 

-]T Mj (T) + M „(T) 

n i=i 
1 " 

-E^'( T ) -m t ) 
i n 

2 :Ew( T ) 



i=i 



j'=i 



2sup/Xi(T) 

iei 



(105) 



Now, 2sup ieI /ii(T) can be made smaller than e by choice of r. This gives, 



f\f n (x)-g h (x)\dx < 6e+Y, 



1 " 
-$>i( B )-M£) 



i=i 



(106) 



where r depends on e, T, and TV depends on e, if. 
By Lemma 1, for S > 6e and p G (0, 1), 



p I \r-9H\>s) < p E 



, B7I> 



1 " 



j=i 



> <5 - 6e 






(107) 
(108) 



This concludes that the proof 5 =£- 4 for nonnegative if. Note that the inequality can be forced for all n, h with 

16 + 4 d+1 



n > 



P 5 2 
po z 



(109) 
(HO) 
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if we pick 



For the symbol-by-symbol case, d — 1 and ( II 101 ) becomes 



■-SO-iH 



n > 



16 + 4 d+1 
~~pT 2 

ih d > n (C, p, S, K) = ^ 2 



(111) 
(112) 



Appendix III 
Proof of Theorem[2] 

Definition 6 (Prohorov metric): For any two laws P and Q on the set [a, b] C K, the Prohorov metric, p is 
defined as 

p{P,Q) := inf{e > : F A (B) < P{B e )+e,B e B [aM } 

where B e = {£ : |x — x| < e, a; G -B}. 



Proof: [Proof of Theorem H) Let P n and Q„ denote the laws associated with the distribution functions, F x n 
and F x n, From [11, Theorem 11.7.1], p(P n ,Q n ) —> => (3(P n ,Qn) then by definition of the /3-metric, we have 



lim 

n— *oo 



fd(P n -Q n ) 



V||/[[bl<1 



(113) 



,[a,b] 



By a mere scaling, the above statement is also true for a uniformly bounded Lipschitz class of functions, S AI : = 
{/ : H/Hbl < M, f : [a, b] -> M} for some M < oo. It is also true that 

Vy and /e5 [a ' 6]xK 



lim 

n — »oo 



f(x,y)d(P n -Q n ) 

where S [ £ b]xR := {/ : [a, b] x R -► R, || f(y) \\ BL < M Vy} for some M < oo and 

\f(x,y)-f(z,y)\ 



f(y) IU:=sup- 



\x-z\ 
= sup f(y,x) 



II M II. 

IE 

II /(y) UblHI f(y) h + \\ f(y) IU 

Hence, for a channel with conditional densities, {fY\x}xe[a,b] S SjJ-' , we have 

Vj/eM 



dy^O 



f Y \ x dF x n — I f Y \ x dF x ™ 
and by dominated convergence theorem, 



and hence 



, d ([F x „ 



>C]y, 



F x n ®C 



fy\xdF x n — J jy\ x dF x r 

->0. 



(114) 

(115) 
(116) 
(117) 

(118) 

(119) 
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Hence, the mapping of input empirical distributions to output densities induced by the channel, 

fm (y) = [F x . ®C] Y = J f Y{x dF xn (x) (120) 

is continuous with respect to the j3 metric on the input distributions and the total variation metric on the output 
densities. We also have the fact that ( J 7 ^-^ , 0\ is a compact [11, Theorem 11.5.4 , Corollary 11.5.5 ] metric space. 
Since, we have a continuous 1-1 (bijection) mapping between the compact metric space of input distributions with 
the (3 metric, (T^- a ^ , 0\, and the space of output densities, with the total variation metric, ( [.T 7 ^' 6 ! (8> C] , d),we can 
apply the continuous mapping theorem [30] to get continuity in the inverse mapping too. This gives the desired 



result that as d([F x n ® C] Y , 
fact [11], \<p, \{F x n,F x S) 



F x n ®C 

0. 



0, we have f3(P n ,Q n ) — * and p(P n ,Q n ) —> 0. Finally using the 

■ 

Appendix IV 
Proof of Lemma|2] 



Proof: 
Consider / e Cb([a,b}), where Cf, denotes the set of all continuous bounded functions, / : [a, b] 
F e J 7 ^^ and P A that is constructed using (J3TJ 

J fdF(x) - J fP A (dx) 

f (dF(x) ~ P A (dx)) 

N 

I fdF(x)-J2f(cn)P (<* 

i—1 
W-l r a i+1 

E / (/(«*) + w /( A )) dF ^ - E f^ p (°*) 

t=0 Ja * 1=1 

N-l N 

E (/(«,) + W/(A)) P (a t ) - E f(fli)P (Oi) 



For any 



(=0 



iV 



W/ (A)E^K) 

w/(A) 



(121) 



where w/ (A) = max^gu w \f(y + A) — / (j/)| and TV is the number of quantization levels as defined previously. 
Hence, 



lim \P A f -Pf\ 



lim 

A^0 



f (dF(x) - P A (dx)) 



(122) 
(123) 

v/ec&(M) (124) 

This implies weak convergence of _P A => P. Hence, the statement of the theorem follows from the Prohorov metric 
that metrizes weak convergence. ■ 



= lim ujf(A) 
A^o J 



= 0, 



3d 



Appendix V 
Proof of TheoremO 

Using the definition of the Lipschitz norm of the loss function, A, and the channel continuity function, £a, we 
bound the deviation of the expected value of the loss function under two marginal densities induced at the output 
of the memoryless channel by the corresponding empirical distributions of the underlying clean signal at the input 
of the memoryless channel. 

Lemma 5: For any F,FG J r ^ a,b \ measurable g : R — > [a, b] and a bounded Lipschitz loss function with 
E fYlu A(u,g(Y)) < oo, Vw, 

\E mc HUo,g(Y)) - E p ^ c A(U ,g(Y))\ 

< (|| A \\ L +A max || 5 \\ L +{b - a) || A || L || S || L +A max ) p (>,P) (125) 
where P and P are the laws associated with F and F, (3 ( P, P) is the (3 metric between the corresponding laws. 



Similarly, we bound the deviation of the expected loss function under the marginal density induced by any 
empirical distribution at the input of the memoryless channel from that of the expected loss under the marginal 
density induced by the corresponding probability mass function (under the mapping discussed in section IHI-Cb . in 
the following Lemma 

Lemma 6: For any A > 0, F £ J 7 ^^ with the associated law P, P A G JT A , measurable g : R — ► [a, b] and a 
continuous bounded loss function with Ef Y ,A(u,g(Y)) < oo, V u , 

\E P ^ c A(U ,g(Y)) - E mc A(U ,g(Y))\ < £ A A max + A(A) (1 + &) 

where A(A) is the global modulus of continuity of the loss function A as defined in equation (@]l and £a is as 
defined in ( f38l >. 



The proofs for Lemmas [5] and [6] are discussed in the following section, Appendix [VI] 
Lemma 7: For every n > 1, x n s [a, b] n , measurable g : M — > [a, b], and e > 0, 



Pr 



1 " 

- ^ A( Xi , g(Yi)) - E Fxn ® c A(U, g(Y)) 



> e < 2exp(-G(e,A max )n) 



(126) 



Proof: By linearity of expectation, ~ Yl7=i EA-( x ii9(Yi)) — E F:i:Tl 0cA(U,g(Y)). Thus, the expression inside 

the absolute value brackets in ( 1126t is a sum of zero mean random variables, bounded in magnitude by A max . 

Furthermore, A(x il g{Y i )) and A(xj,g(Yj)) are independent whenever i ^ j. This allows the use of Hoeffding 

inequality [8] as in [5] leading to (11261 1. ■ 

In preparation of the proof of Theorem |4j we need also the following two Lemmas 



Lemma 8: d(/ y , F xn ®C ) -> a.s. 



Proof: By definition, 



o < d(/£, 



F x „ ® C 



<d(/£,[F x „®C] y ), Vn 



Taking limit n — > oo in the inequality of (1127k we get 



0< lim d(/™, 



F x ™ ®C 



)< lim d(/?,[F x »®C] y ) 

y n — >oo 



= a.s. 



where the second part of the inequality in ( 11271 ) follows from Theorem Q] 
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Lemma 9: d{[F x ^ <g) C] Y , F x ™ ®C ) — ► a.*. 



Proof: 



< d([-F x » ® C]y , 



F x n ®C 



)<d([F x n®C] K) #)+d(/ : 



Kj 



F x « ® C 



We have already seen d([F x n <S> C] Y , fy) ~ * a - s an d by Lemma [S] 



Whence, 



d([F x „®e] y , 



F x n ®C 



)-*■() a.s. 



F x n ®C 



) -^Oa.s. 



We are now ready for the proof of Theorem |4j Proof: [Proof of Theorem |4l We fix n > 1, X n € [a, b] n , 



E 



^ [Yn]m K{U,g(Y)) - E Fxn ® e A(U,g(Y)) 



< 



E 



>^[Yn]®cMU,g(Y)) - E Pxn[Y ^ c A(U,g(Y)) 



E F« n[ Y»]®cMU,9(Y)) - E Fmn9C A(U,g(X)) 



(127) 



Hence, 



Pr sup 

\g-M^[a,b] 



E 



p ^ [Yn] ^ c A(U,g(Y)) - E F ^ c A(U,g(Yj) 



> e + SA max + £AA max + 



Pr 



\(A)(l+U))<Pr(\Ep !cnlYn](SC A(U 1 g(Y))-E Fscn ® c A(U 1 g(Y))\>e)+ (128) 

(\ E Px~\r«]9cHU,g(y)) - E p s,A {Ynm A(U,g(Y))\ > M max + £ A A max + A(A)(1 + &)) (129) 



3X 



Now, 



Pr 



E, : 



Ft 



, [Yn]0C A(U,g(Y)) - E Fxn(sc A(U,g(Y)) 



>£ < 



Pr ((|| A || L +A max \\E\\ L +(b - a) || A || L || 5 || L +A raax ) (3 (p^, P x „) > e) (130) 

< Pr ((|| A \\ L +A max || H || L +(b - a) || A \\ L \\ S || L +A raax ) d (.F x » ® C, F x « ® c) > e) 

< e -d-p)^, 
for all nh n >n (C,p,S,K) (131) 



where C is the family of channel densities {fy\x}- The inequality in ( 1130b is due to Lemma |5J while the first 
inequality in ( 1131b is by application of Theorem [2] and the second inequality is due to Lemma [9] and Theorem [TJ 
Finally, application of Lemma [6] to d!291 l yields 



Pr sup 

\o:R->[a,6] 



£ 



p ^ [yn]m k{U,g{Y)) - E F . n9C A(U,g(y)) 



> e + 6A max + £ A A E 



A(A)(1 + £ A )) ^e^ 1 -"^, for all n > n (C,p, <5, if) 



(132) 



Combining ( I132l > with Lemma [7] gives 



Fr 



1 ™ 

- ^ Afo, g(Y t )) - E p6 ^ c A(U, g(Y)) 



> 2e + 2<5A max + £ A A max + A(A)(1 + £ A ) 
< 2e -G(e+5A max ,A max )n + e -(i-rtB^^ for all n/i n > n (C, p, <S, .ST) (133) 
By the union bound, ( 11331 > guarantees that for any class Q 



( 1 " 

Pr max - V A (a*, 5 (y,)) - P p 

\ «=1 



A A g)CA(c;, 9 (y)) 



> 2e + 25A max + C A A n 



-A(A)(1 + &))<|0| 



2e -G(e+A-A max ,Amax)n + g-(l-p) 2 



(134) 



Consequently, 



p r 



L^ n , 6tA (x n 7 Y n )- min E p s,±„ c A{U, g{Y)) 



+A(A)(l + £ A ))=Pr 



> 2e + 2(5A max + CAA max 
- £ A^^^^n^i)) - E p s^ c A(U,g opt [P!k A [Y n }](Y)) 



i=l 



> 2e + 2JA max + C A A max + A(A)(1 + £ A )) 



< Pr max 



1 ™ 

- £ A( Xi , g(Yi)) - E p .,a 9C A(U, g{Y)) 



i=l 



+ A(A)(i+a))<i^, A | 



> 2e + 2<5A max + C A A max 

2e -G(£+<5A max ,Amax)n + g-(l-p) 1 



(135) 



where the first equality follows from the definition of X n ' > and the fact that for any P 6 Fs,A 

min E P ®cA{U,g(Y)) = E P ® C A(U, g op t[P}(Y)) 
g£Gs,& 



,™ 



The first inequality follows by the fact that P^ A [Y n ] e F S<A and therefore g opt [P*k A [Y n }} e &,. and finally the 



last inequality follows from (1134b . It also follows, from ( 1132b , that 



Pr 



uuu E ps ^ c A(U,g(Y))- min £ F ^ C A([/,G(Y)) 



geSs.A x " 



9ee 5 „ 



> 



+ <5Amax + £ A A max + A(A)(1 + &)) < e-^-") 1 



(136) 



Combining ( 11351 ) and ( 11361 ) gives 



Pr 



£*„,*,* Or", Y")- min £ FxIl0C A(£/, 5 (Y)) 

ff£S<S,A 

2A(A)(1 + £ A ))<|&, A | 

i<5, A 



> 3e + 3<5A max + 2£ A A max + 

2e -G( £ +5A max ,A max )n + e -{l-p)2t£. 



-(I-P) 1 ^ 



(137) 



On the other hand, letting P x '„ denote the element in Ps,A closest (under the Prohorov metric of the corresponding 

measures) to F x ™, 



D (x n )- min E F ^ c A(U,g{Y)) 
sGy.s,A 



min E Fmn9 cA(U, gopt [F](Y))- min E Fxn ®cA{U,g(Y)) 



Ferl 



»,b] 



gBQs.A 



< 



Fer$? M 



E F s,A„ e A(U,g opt [F}(Y))- min E F ^ c A(U,g{Y)) 



geG s „ 



A max £ + £ A A max + A(A)(1 + £ A ) 
p mm 4 E p s,^ c A(U, g op t[P](Y)) - ngi E P ^ e A{U, g{Y)) 
A max <5 + £ A A max + A(A)(1 + £ A ) 



min E^.a c A(U,g(Y))- min E Fxn<s>e A(U,g(Y)) 



A max (5 + ( A Amax + A(A)(1 + £ A ) 

< 2(A max< 5 + £ A A max + A(A)(l + £ A )) 



(138) 

(139) 

(140) 

(141) 
(142) 



where (11391 ) and (1142b follow from Lemma [6] and ( 1 1401) follows from the fact that the achiever of the minimum 
in the first term of i 1391 ) is F x '„ which, by definition, is a member of Ts,a- Finally, combining ( 1136b with ( 1142b 
gives 



Pr (|ijfn.i.A(a: n ) K") - £> (ar n )| > 3e + 5<5A max + 4£ A A max + 4A(A)(1 + &)) 



< l&,z 



-G(e+5A max ,A max )« + e -(l-p)Sq- 



-{i- P y 



(143) 



for all nh n > no (C, p, S, K) 



From the definition of Qs,a, it is clear that \Qs.a\ < [f + 1] ■ Hence, 

Pr {\Lxn, s Ax n , Y n ) - D (x n )\ > 3e + 5<5A max + 4£ A A max + 4A(A)(1 + £ A )) 



< 



A 



-G(e+5A mM ,A max )n + e _(l-p)2X_ 



-(I-P) 2 ^ 



for all nh„ > n (C, p, S, K) 
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(144) 



Appendix VI 
Proof of Lemmas [5] and [6] 
We need the following proposition for the proof of Lemma [5] 

Proposition 1: A(x) — J A(x,g(y)) f Y \ x {y)dy is a bounded Lipschitz function for any measurable g 
[a,b]. 

Proof: Let A = \x — x'\, 

A(x)-A(x') = A(x,g(y))f Y \ x {y)dy- A(x',g{y)) f Y]x -{y)dy 

< J (A (a/, g(y)) + A (A, x)) f Y \ x (y)dy - J (A (x', g(y))) f YW (y)dy 

< J (A (x', g(y)) + A (A, x)) (fy\Av) + ^(.v)) dy - J (A (x', g(y))) fr W (y)dy 

< A (A, ar) + A max £ A + A (A, x) £ A 
Also, 



A(x)-A(x')= / A(x,g(y))f Y{x (y)dy- / A(x',g(y)) f Y \ x >{y)dy 

> J (A (a/, ff (|/)) - A (A, ar)) f Y \ x (y)dy - J (A (a/, 3(2/))) f YW (y)dy 

> j (A (a/, <?(y)) - A (A, *)) (/r|^(y) - s A («)) % - j (A (a;', 3(2/))) fr w (y)dy 

> -A (A, x) - A max £ A + A (A, x) U 

> -A (A, a;) - A max £ A - A (A, x) U 
Hence, \A(x) - A(x')\ < A (A) + A max £ A + A (A). 

The assumption of Lipschitz continuity (condition, C6) of the channel guarantees lirriA— >o £a = 0. With this and 
the fact that lim A ^o A (A) = 0, we have 



lim \A(x)-A(x')\=0 

\x — x* |<A 
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Moreover, 



A \\l = sup sup 

0<A<(b-a) ">#«>' 

|x-x'|=A 



\A{x)-A{x')\ 



< sup 

0<A<(b-a) 



\x — x'\ 

A (A) + A max g A + A (A) g A 

A 



Hence, 



< ||A|| L +A max ||S|| L +(6- a )||A|| L ||S|| L 

A\\bl = || A \\ L + || A |U 

< || A || L +A max || S |U +(& - a) II A || L || S || L +A r 



Proof: [Proof of Lemma [5] 
\E m cHU Q ,g(Y)) 



E^ c A(U Q ,g(Y))\ 

dF(x) (I A(x,g(y))f Y \ x (y)dy 

- [dF(x)( [ A(x,g(y))f Y \ x (y)dy 



dF(x)A(x) - / dF{x)A{x) 



A(x)dlF -F) (x) 
\\A\\ BL p(p,P) 



(145) 



(146) 



(147) 



< (|| A || L +A max || 5 || L +(b-a) || A || L || H || L +A max ) (P, Pj 

where, ( I147l > follows from the fact that A(x) is a bounded Lipschitz function as shown in Proposition Q] Hence, 
as /3 (P,P) - we have |£ F8C A(Uo, 5(F)) - ^f®c A ( C/ o,5(^))| -> 0. 

I 
Proof: [Proof of Lemma [6) 

\E P A® c A(U ,g(Y)) - E mc A(U ,g(Y))\ 






iV(A) 

E /<* 



/y|x=„'(y)rfPK)AK,ff(2/)) 



JV(A) 

- E pA (^) 

i=l 
AT(A) 

- E p >*) 



A (a 4 ,g(y)) fY\x=a,(y)dy 



h(<h,g(y)) fY\x=at(y)dy 



(148) 



Equality in ( 1148b is due to application of Fubini's theorem. Hence, 

\E P ^ c A(U ,g(Y)) - E P9 cMU ,g(Y))\ 



< 



N(A) 

e h 



i=X 



N{A) 



N(A) 

Iy\x=u> (y)dF(u') (A ( ai , g(y)) + A(A)) ' _ 



AT(A) 
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\ N(A) 

- 5 ^-> (/ 



A (<H,g(y)) fY\x= ai (y)dy 



Y, / dj/(A(a i)5 (j/)) + A(A)) / /y|x=„'(y)^K) - E pA ( a ') / A(a„ 5 ( 2 /))/y|x=a l (2/)dy 



(149) 



< 



JV(A) 



]T dy(A (a*, 3(2/)) + A(A)) (/y| X=a , (y) + e(|,)) / dF(u' 



< 



W ( A ) / ,<* 

E / W) 



JV(A) , 

E pA ^) (J k ^g(y))fY\x=aXv)dy 



k(ai,g(y))f Y \ x = ai (y)dy+ / s(y)A(ai,g(y))dy + X(A) j f Y \x= ai (y)dy 

N(A) 



X(A)je(y)dy- ]T P A ( ai ) (J A( ai ,g(y)) f Ylx=ai (y)dy 



(150) 



< 



JV(A) 

E 



dF(u') 



A(ai,flf(y)) fy\x= ai (y)dy 



Hence, 



- / e(y)A(o 1 ,5(|/))di/ + A(A) / f Y \x=aM d y + A(A) / e(y)cfy 

2V(A) 



iV(A) 

E 



dF(u') 



AT(A) 



YP A { ai )(f A{a t ,g 



ete)A(a i)ff (y))dj/ + A(A) + A(A)£ A 



(y)) fY\x= ai (y)dy 



Y (F(ai) - F(oi-i)) fe(y)A(a i ,g(y))dy + X(A)+X(A)U 



i=l 

N(A) 



< / E e ^) A ( a - 5(y)) ^ A (^)rfy + (A(A) + A(A)a) 
^ »=i 

< £ A A max + (A(A) + A(A)£ A ) 
= CAA max + A(A) (1 + Ca) 



Um|S F A 8C A(f7 ,3(F))-£; i r 8 cA(f7o,3(F))| = 



(151) 



(152) 
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Appendix VII 
Proof of Theorem[6] 

In preparation of Theorem [6] we start by presenting the proof of Lemma [3] and Theorem QT| Proof: [Proof 
of Lemma [3) 

1 n—k 

±- £ A(X , 9 (Y* k )) 



n-2k 

i=k+l 
i — k 



= mm 
a 



mm ■ 

g n — 



D k (x n ) = mini; 
a 

— — J2 A(* 4 , 5 ( y ^')) II /r|x=.,(w)dW d53) 

i— k-\-l l=i—k 

i— ^ /A(x 4 , 5 (yi+ fc fe )) f[ /y| X=B ,( W )d W (154) 

-. 2k+l . I — 2fc+i — I -1 

= T¥TT ^ /^r £ A (*i(2fc+i)+fc+i. (155) 

j(2fc+l)+i+2fc 

ff(yj ( ( 2fc+i)+r 2fe )) II fY\x= xl (yi)dyi 

l=j{2k+l)+i 

r n-at-i-l i i 
-. 2/c+l „ 1 I 2T+1 _i 

- m fl in 2FTT 51 / T^^T 51 A (*i(2*+i)+fc+i» d56) 

9 ZH + i ^ J | ^-j. | i=(j 

j(2k+l)+i+2k 

s &$$$?"')) II /^fo)*. 

Z=j(2fc+l)+i 

> — V min / 5j- V Afe, (157) 



j(2fc+l)+i+2fc 

n 

l=j{2k+l)+i 



•i (^jafc+lj+r 2 ;) II fY\x= Xl (yi)dyi 



2fe+l 

1 
2fc 



(158) 



Proposition Q] Lemmas [5] and [6] are extendible to their /c th -order equivalents with the proofs carrying over directly 
from the symbol-by-symbol case. We hence merely state the Lemmas for the fc th -order case and proofs are left out 
in this discussion. 

Proposition 2: A(x) — J A [x,g (y^ k )) 11*=-* fY\x i (yi)dy , L k is a bounded Lipschitz function for any measur- 
able g : [a,b] 2k+1 -> M. 
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Lemma 10: For any P, P G .pi"' 6 ]'*, measurable g : R 2fc+1 — > [a,b] and a bounded Lipschitz loss function with 
E fYln A(u,g(Y]! k ))<oo,Vu, 

\E mc A(U , g(Y* k )) - E P9C K{U ,g{Yh k ))\ 

< (|| A || L +A max || 3 ||* +(b a) || A |U|| 5 ||* +A max ) (p p) 

where P and P are the laws associated with P and F and /3 is the usual /3-metric 

|| 3 ||* is the k th order Lipschitz norm of the channel. 

||3||* = sup \- (159) 

0<A<(b-a) £± 

and £a is as defined in ( f38l >. 

Lemma 11: For any A > 0, P G jF[ a > b ]>* with the associated measure P, P A < fc £ P A ' fc , measurable 5 : R 2fe+1 — » 
[a, 6] and a continuous bounded loss function with Ej Y . A(u, g(Y k : k )) < 00, V u , 

\E P ^ c k{U 0l g{Yl k )) - P F8C A([/ , 3 (r\))| < £ A fc +1 A max + A(A) (l + £ A fc+1 ) 



These are then used to bound the deviation of the cumulative loss incurred by the proposed denoiser for each of 
the 2fc + 1 subsequences from the minimum possible fc th -order sliding window loss for that subsequence. We now, 
state the fc th -order equivalent of Theorem 0] for each subsequence. 

Theorem 11: For all m > 1, k > 1, e > 0, p € (0, 1), S > 0, A > 0, and x' n G [a, 6p fc+1 ) m 

Pr (|L* m ,,,A.* (x m , y m ) - P fe (a; m )| > 3e + 5<5A max + 4£ A fe+1 A max + 4A(A)(1 + £ A fc+1 )) 



< l# 



<5.aI 



,-G(e+<5A max ,A max )m + e _(l- p )^i 



+ e -(i-p)- f i (160) 



for all m/i^ > m k (C, p, <5, K) 



where, 

7fe 



II A || L +A max || 3 HI +(6 - a) || A IUH 3 ||* +A max ) 
and G, ^ A are as defined in Theorem [6] 

Proof: The proof of this theorem carries over directly from the proof of Theorem [4] using Proposition [2] Lemmas 
TO] ED and |7] 
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Proof: [Proof of Theorem [6) 

L jtn , s ,^ k {x n ,Y n )-D k {x n ) = 



2fc+l _, 2fc+l 



From Lemma |2 we have 

2fe+l 



w*(* s .n - ^*(* n ) < L jtn , S:& , k (x n ,Y n ) - ^-j jr D k (x n >) 

i—l 

1 2&+1 1 2k+l 



2fc-, 

4=1 1=1 

1 2/c+l 

^ ^^tE^^.a,,^,^)-^^)!] d62) 

i—l 

Hence, 

Pr (L^.^, k (x n ,Y n ) - D k (x n ) > 3e + 55A max + 4^ +1 A max + 4A(A) (l + £^ +1 )) 

/ 2fc+l \ 

" Pr 2fcTT ? I^HAA, fc (^,^"-)-^(^ ! )l >3e + 5,5A max + 4ei fc+1 A max + 4A(A)(l+ei fe+1 ) 

2fc+l 

< E Pr(|^„„,, A , fc (x"-,y"')-^fe(^)l >3e + 5<5A max + 4^ +1 A max + 4A(A)(l + ^ +1 )) 

,-G( e +iA m „,A=,«)^Tf 1 +e -(l-P) 1 Wlf 



<(2fc+l)|^, A | 



g V- 1 W 2(2fc+l) 



This is true by applying Theorem QT| to the 2fc + 1 subsequences of independent supersymbols with at most |rrrf 
supersymbols in each of them. Also, the cardinality of the set of all possible proposed 2k + 1-length sliding window 
denoisers is bounded by the cardinality of the set of all possible quantized fc th -order probability mass functions, 

e A ' fc ,i.e.,|^ A |<[I + l] A2fc+1 . 

■ 

Appendix VIII 
Proof of Theorem[8] 
The following claim is necessary for the proof of Theorem [8] 



Claim 1: 



lim mmEA(X ,g(Y!: k ))=B(F x ,C) 



The claim results from the following lemma. 

Lemma 12: < For k, I > 0, EU \F Xo \ Y i ) is decreasing in both k and I. 
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• For any two unboundedly increasing sequences of positive integers {k n }, {l n }, 

lim EU (F |y ,„ ) = EU (f Xo]y ~ ) (163) 

n^oo V ^ol r -fc„/ V ul -<*>J 

Equipped with Lemma Q~2] the proof for Claim Q] is very similar to that of Claim 2 in [36] but we, nevertheless, 
present here for completeness. 



A. Proof of Lemma \12\ 

Proof: 
A direct consequence of the definition of the Bayes envelope U (•) is a concave function. Specifically, for two 
distribution functions F and G defined on [a, b), and a € [0, 1], 

U (aF + (1 - a)G) = min / A(x, x)d (aF + (1 - a)G) (x) 

= a min / [A(x, x)dF(x) + (1 - a)A(x, x)dG(x)] 

*e[a,6] J X £[a,b] 



> a min / A(x, x)dF(x) + 

x£[a,b]Jxe[a,b] 

(1 — a) min / A(x, x)dG(x) 

xe[a,b] J xe [a.b] 



= aU (F) + (1 - a)W (G) 
where the first equality follows from the fact that the mapping, F h^ Ff, Ff = J fdF, for a bona fide distribution 
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1X\Y' 



i xlY i_+i j dFyi+i 



function, is linear. Next, to show that EU \[F ®C] x , Y i ) decreases with I, observe that 

EU( \r C 1,,,/ i ) = / u([f®c] 

{[F®C] : 
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dF v 
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dF v 



(164) 



where, the first inequality follows from the fact that U is a concave functional mapping. The definition of [F eg) C] x i y 
is bona fide from the assumption that the family of conditional measures, C, is absolutely continuous. Finally, 
application of Fubini's theorem permits the change of order of integration to achieve the final inequality. The fact that 



EU ( [F <g> C) 



X\Y l _ 



decreases with k is established similarly, concluding the proof of the first item. For the second 



item, similar to the proof of Lemma 4 in [36], by the martingale convergence theorem, we have, F x , y i. 
a.s., implying F^ 



X\Y° 



F 



X\Y% 



f 



X\Y_ 



X\Y°° 



Fx\y°° ■ Using the convergence of random measures [20, Theorem 16.16], we have 



/, Vf € C^, the class of continuous positive valued functions with compact support. Here, the 



notation Ff = J fdF for any measurable / and bona fide probability distribution function, F. In section|IV] we have 
imposed the condition of continuity of the loss function, A, and since the input alphabet space is restricted to a closed 
compact interval [a, 6], we satisfy the condition, A € C^. Hence, we have, F x , Y i n A(-,x) — » F X \y™ A (•,£), 
V, x. Since A(-,x) : [a,b] x [a, b] — > R + is a continuous mapping, in x, min^ e [ a a J A (x,x) dF(x) is also 
a continuous mapping. Using the fact that A is a bounded mapping and the continuous mapping theorem [12], 



U(F 



x\YL\ r 



U 



\F X \y^\ 



and EU F 



xy'i 



EU[F xlY ~ 
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B. Proof of Claim Q] 

Proof: [Proof of Claim [Q 

B(Fx»,C) = 



mm 



1 " 
SL^„ (X", F") = - Y^ min ^ A [Xi,£ ( Yn ) 






0,6] 



= - V / min £ [A (Xi, x)\Y n = y n ] dF Y ~ 

nf^J Rn xe[a,b] 

1 " r 

= "E/ W(^ 4 |r»=„»)diV- 

1 n 1 n 



(165) 



i=l i=l 

where the last equality follows by stationarity. Since by Lemma [12] EU [ F 



xo\vr 



> EU (Fx \y™\, it follows 
from <fl65l) that D (F Xn ,C) > £W (-Fxolrf^) for all n and, therefore, B (F X ,C) > EU (f Xo \y~)- On the other 
hand, for any k, < k < n, Lemma [12] and dl65t yield the upper bound 

n—k 

2kU(F Xo )+ £ Eu(F XQ]Yr _ 7 ) (166) 



'(i^x.C) < - 
n 



< 



i=fc+l 
n— k 

2kU(F Xo )+ J2 Eu(F Xo \ Y u k 

j=fe+i 



2fcW (F Xo ) + (n - 2fc) £W [-Fjtoiri 



(167) 



(168) 



Considering the limit as n — ► 00 of both ends of the above chain yields B (Fx, C) < EU ( F x < Y k ) ■ Letting now 
k — > 00 and invoking Lemma [T2l implies B(F X ,C) < EU \Fx \y°° )■ 



C. Proof of Theorem \E\ 

Proof: By definition of H)(Fx.,C) clearly 

limmf EL xn (X n ,Y n ) >B(F X ,C) 

71— >00 univ 

On the other hand, from d45t , for any & 

££>*(*") - £minV n8C A(l,g(y_\)) 
£ F ^ c A(X, ff (y\)) 
min J BA(X,. 9 (r fe fc )) 



< min E 

g 



(169) 



where, the right side X^ k is emitted from the (unique) double-sided extension of the source Fx- Using the result 
from equation dl69t , we get 



limsupFL» fe „ (X n ) < B(F X ,C) 



(170) 



49 



implying, by Theorem UJ and bounded convergence, that 

lim sup £1^ (X n ,Y n ) < D(F X ,C) (171) 

n — >oc 

and proving do*3l l. To prove ( |64t assume stationary ergodic X. We have established the continuity of Ep^c^- (Uo,g(Y)) 
w.r.t F G jF[ a ' h l in Lemma \5\ and it is easily extendible to min g Ep®c-h (Uo,g(Y)). By the ergodic theorem and 
continuity of min g Ep®c-h (Uo,g(Y)) in F € J 7 ! ' 6 -!, it follows from the representation in d45l l that 

D k (X) = lim D k (X n ) = mmEA (X ,g(Y± k )) a.s. (172) 

n — >oo g 

and by Claim [T] 

D(X) =B(F X ,C) a.s. (173) 

Thus, the fact that limsup n ^ 00 Dfc n (x), V x G [a, 6]°° (recall proof of Corollary 1), combined with Theorem [7] 
implies 

limsupi^„ (X n ,Y n ) <B(F X ,C) a.s. (174) 

n — 'oo unlv 

On the other hand, by Fatou's lemma and definition of D (Fx,C) 



E 



limsupL^„ (X n ,Y n ) 



> lim sup EL^ n (X n ,Y n ) >B(F X ,C) (175) 



The combination of ( 1174t and (11751 completes the proof of J64] i ■ 

Appendix IX 
Comparison to the denoiser in [5] 

Referring to Fig. [5] each output alphabet is uniformly quantized to the same number of levels, At, as the input 
(for Y G M, the end-intervals are greater than quantization step size). We label the set of quantization intervals at 
the output as O = {Oi, • • • ,Om} and let the quantization step size be a. Corresponding to the channel output, 
Y n , let Z n be the corresponding quantized version. Also, let A denote the M -level finite alphabet set at the input. 

As a result of the quantization, we propose mapping the fc th -order kernel density estimate at the output, fy , 
to the corresponding probability mass function, Q\ n , with mass at the quantized output alphabets in the following 
manner, 

<& \v n \ (A) = / f?\vtk)dyt h (176) 

where, v^ k is the corresponding 2k + 1-tuple of the quantized levels. The channel conditional densities also get 
correspondingly mapped to an M x M channel matrix that is formed using, 



n(»,i)=/ f Y \ x=l {y)dy (177) 

Jy-Qa(v)=j 
where Q a ( m ) denotes a uniform quantizer with a quantization step size a. 

We compare Q\ n [y n ] (u^. fe ) to P% \y k _ k ), the fe-th order distribution of the quantized output symbols, using the 

notation in [5]. 

e (A) = r -^^ (178) 
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The density estimate, fy , we consider is the cubic histogram estimate. The histogram estimate is defined by 

1 " 1 

fy' k (y) = -Y: ^jff , v e A n ,, y e r 2 ^ 1 (179) 

where, V n — {A n j,j = 1,2, • • • },n > 1 is a sequence of partitions and A nj -'s are Borel sets with finite nonzero 
Lebesgue measure. The sequence of partitions is rich enough such that the class of Borel sets (B^ a ' b ^) is equal to 

oo / oo \ 

n <j{ \jv m \ (180) 

n— 1 \m—n / 

where a is the usual notation of the cr-algebra generated by a class of sets. In particular, the cubic histogram 
estimate is constructed when we consider sets A n j of the form, Ili=i [dikih, cu(Jti + l)h), fcj's are integers, h 
is a smoothing factor as for the kernel density estimate in J179I ) and a^'s are positive constants s.t. aikih 6 [a, b], 
V/i, ki. The following result similar to that in Theorem Q] for J n defined in equation d251 l. holds for histogram 
density estimates. 

Theorem 12: Assume that the sequence of partitions V n satisfies ( 1180b . Consider 

1) Jn ~^ in probability as n — > oo, for all sequences x n 

2) J n ^ almost surely as n — > oo, for all sequences x n 

3) J n — > exponentially as n — > oo, for all sequences x n 

4) For all A E B with < X(A) < oo, and all e > there exists no such that for all n > no, we can find 
A n e a (V n ) with A (AAA n ) < s and 

/ 

sup limsupA \J A nj f]c\=0 (181) 

Af>0all sets C of finite Lebeseue measure n— >oo , . 

yj:A(A„ Jno )<f 

It is then true that 4=^3=^2=^1, 



For the proof of this theorem, refer to [7] with the added condition of tightness imposed on the family of measures 
associated with the channel, C. 

The condition 4) in Theorem [T2l translates to liirin^oo h = 0, limn—Kx, nh d = oo. It can be shown as in [7] that 
they are necessary sufficient conditions for that specified in 4) in Theorem [T2] By choosing the smoothing factor, 
h to be a decreasing sequence of numbers that are all integers fractions of the quantization step size a, such that 
nh d — > oo is also simultaneously satisfied, we get the mapping in equation ( 1176b to reduce to that in equation (1178b 
for the subsequences described in Section [V] This is because we split the sequence x n into 2fc + 1 subsequences 
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whose 2fc + 1-length super symbols are independent so that we can apply Theorem [12] Now, 

Q%(v1 k ) = I K" k (y1 k )dy1 k 



v'L k &o 2k + 1 



y- k £ 



r r.-2fc-i-l 1 -. 

I SE+1 I 1 T j(2fc+l) + i + 2fc 

1 ^ — -s ' . . 

re — 2k — i— 1 n / ; 



€A„ 



rn-^fe-t-1 "1 
2k + 1 I 2fc+l I 



3=0 



HA nil ) 



r »-2fc-»-l f r [ Z "* ' U -fe] 
I 2fc+l I 



(182) 



(183) 



(184) 



If we mapped the finite input-continuous output channel, C, to n, the mapping in equation J48l would then reduce 



to, 



are rnin > 



U(v k _ k )- Y, \{^{u 3 ,v 3 )P(u k _ k ) 



(185) 



where, T A ' k denote the space of all possible fc th -order distributions on A. If we lift the constraints of the minimizer 
being a bona fide element of T A,k , we get the following candidate for the minimizer in (11851 ) 



I 2fe+l I „fc 



(186) 



3 = -k 



which is exactly the same as P x »j [z' li ] (u k _ k ) using equation (18) in [5], also given below. 

p^ [»- t ] = r n - 2fc 1 - < -i 1 -E r [ z " < ^-*] II n- 1 («,-.«,-) 

I 2fe+l I „fc J=-k 

Now, using the construction of the discrete denoiser in equation ( TSUI) , for Q x n t , we get 



(187) 



3opt[Qx««] (y- k ) = axgmhxA(-,x) T [Q x «i ®C] uly h 



xeA 



are min > A (a, a; 



ze.4 



ae.4 



E 

. 6^t 2fc + 1 :u =a 



II h\x=u 3 {yi)Q^ (U k k = u k l k ) 
i=-k 



(188) 



which is exactly the same as g op t[P] iV-k) i n equation (16) in [5], Hence, the proposed denoiser with histogram 
density estimate of the output symbols and quantization gives us the same denoising rule as that of [5] applied to 
the 2fc + 1 subsequences of the output sequence Y n . 
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RMSE = 14.7354 



RMSE = 13.0945 




RMSE= 11.2899 



RMSE= 11.2610 




RMSE= 11.1782 



RMSE = 7.842 



Fig. 5. Row 1- left: Original image, right: Noisy image, a = 20; Denoised Images using, Row 2- left: k = 1 right: k = 2; Row 3- left: 
k = 4, right: k = 6; Row 4- left: the scheme in [9], right: the scheme in [26] 
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Fig. 6. Comparison of RMSE of the denoised image for various context lengths, k 
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Fig. 7. Row 1- left: Original image, right: Noisy image; Denoised images using Row 2- left: symbol-symbol scheme, right: Comparison of 
Distribution estimates for the symbol-by-symbol denoiser 
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Fig. 8. Row 1- left: Original image, right: Noisy image; Denoised images using Row 2- left: proposed scheme, right: BLS-GSM [26] 



